Regularizing an LSTM

Recurrent neural networks, in general, are hard to train, because of the problem of vanishing activations and gradients we saw before. Using LSTM (or GRU) cells makes training easier than with vanilla RNNs, but they are still very prone to overfitting. Data augmentation, while a possibility, is less often used for text data than for images because in most cases it requires another model to generate random augmentations (e.g., by translating the text into another language and then back into the original language). Overall, data augmentation for text data is currently not a well-explored space.

However, there are other regularization techniques we can use instead to reduce overfitting, which were thoroughly studied for use with LSTMs in the paper “Regularizing and Optimizing LSTM Language Models” by Stephen Merity, Nitish Shirish Keskar, and Richard Socher. This paper showed how effective use of dropout, activation regularization, and temporal activation regularization could allow an LSTM to beat state-of-the-art results that previously required much more complicated models. The authors called an LSTM using these techniques an AWD-LSTM. We’ll look at each of these techniques in turn.

Dropout

Dropout is a regularization technique that was introduced by Geoffrey Hinton et al. in Improving neural networks by preventing co-adaptation of feature detectors. The basic idea is to randomly change some activations to zero at training time. This makes sure all neurons actively work toward the output, as seen in <> (from “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” by Nitish Srivastava et al.).

A figure from the article showing how neurons go off with dropout

Hinton used a nice metaphor when he explained, in an interview, the inspiration for dropout:

: I went to my bank. The tellers kept changing and I asked one of them why. He said he didn’t know but they got moved around a lot. I figured it must be because it would require cooperation between employees to successfully defraud the bank. This made me realize that randomly removing a different subset of neurons on each example would prevent conspiracies and thus reduce overfitting.

In the same interview, he also explained that neuroscience provided additional inspiration:

: We don’t really know why neurons spike. One theory is that they want to be noisy so as to regularize, because we have many more parameters than we have data points. The idea of dropout is that if you have noisy activations, you can afford to use a much bigger model.

This explains the idea behind why dropout helps to generalize: first it helps the neurons to cooperate better together, then it makes the activations more noisy, thus making the model more robust.

We can see, however, that if we were to just zero those activations without doing anything else, our model would have problems training: if we go from the sum of five activations (that are all positive numbers since we apply a ReLU) to just two, this won’t have the same scale. Therefore, if we apply dropout with a probability p, we rescale all activations by dividing them by 1-p (on average p will be zeroed, so it leaves 1-p), as shown in <>.

A figure from the article introducing dropout showing how a neuron is on/off

This is a full implementation of the dropout layer in PyTorch (although PyTorch’s native layer is actually written in C, not Python):

In [ ]:

  1. class Dropout(Module):
  2. def __init__(self, p): self.p = p
  3. def forward(self, x):
  4. if not self.training: return x
  5. mask = x.new(*x.shape).bernoulli_(1-p)
  6. return x * mask.div_(1-p)

The bernoulli_ method is creating a tensor of random zeros (with probability p) and ones (with probability 1-p), which is then multiplied with our input before dividing by 1-p. Note the use of the training attribute, which is available in any PyTorch nn.Module, and tells us if we are doing training or inference.

note: Do Your Own Experiments: In previous chapters of the book we’d be adding a code example for bernoulli_ here, so you can see exactly how it works. But now that you know enough to do this yourself, we’re going to be doing fewer and fewer examples for you, and instead expecting you to do your own experiments to see how things work. In this case, you’ll see in the end-of-chapter questionnaire that we’re asking you to experiment with bernoulli_—but don’t wait for us to ask you to experiment to develop your understanding of the code we’re studying; go ahead and do it anyway!

Using dropout before passing the output of our LSTM to the final layer will help reduce overfitting. Dropout is also used in many other models, including the default CNN head used in fastai.vision, and is available in fastai.tabular by passing the ps parameter (where each “p” is passed to each added Dropout layer), as we’ll see in <>.

Dropout has different behavior in training and validation mode, which we specified using the training attribute in Dropout. Calling the train method on a Module sets training to True (both for the module you call the method on and for every module it recursively contains), and eval sets it to False. This is done automatically when calling the methods of Learner, but if you are not using that class, remember to switch from one to the other as needed.

Activation Regularization and Temporal Activation Regularization

Activation regularization (AR) and temporal activation regularization (TAR) are two regularization methods very similar to weight decay, discussed in <>. When applying weight decay, we add a small penalty to the loss that aims at making the weights as small as possible. For activation regularization, it’s the final activations produced by the LSTM that we will try to make as small as possible, instead of the weights.

To regularize the final activations, we have to store those somewhere, then add the means of the squares of them to the loss (along with a multiplier alpha, which is just like wd for weight decay):

  1. loss += alpha * activations.pow(2).mean()

Temporal activation regularization is linked to the fact we are predicting tokens in a sentence. That means it’s likely that the outputs of our LSTMs should somewhat make sense when we read them in order. TAR is there to encourage that behavior by adding a penalty to the loss to make the difference between two consecutive activations as small as possible: our activations tensor has a shape bs x sl x n_hid, and we read consecutive activations on the sequence length axis (the dimension in the middle). With this, TAR can be expressed as:

  1. loss += beta * (activations[:,1:] - activations[:,:-1]).pow(2).mean()

alpha and beta are then two hyperparameters to tune. To make this work, we need our model with dropout to return three things: the proper output, the activations of the LSTM pre-dropout, and the activations of the LSTM post-dropout. AR is often applied on the dropped-out activations (to not penalize the activations we turned into zeros afterward) while TAR is applied on the non-dropped-out activations (because those zeros create big differences between two consecutive time steps). There is then a callback called RNNRegularizer that will apply this regularization for us.

Training a Weight-Tied Regularized LSTM

We can combine dropout (applied before we go into our output layer) with AR and TAR to train our previous LSTM. We just need to return three things instead of one: the normal output of our LSTM, the dropped-out activations, and the activations from our LSTMs. The last two will be picked up by the callback RNNRegularization for the contributions it has to make to the loss.

Another useful trick we can add from the AWD LSTM paper is weight tying. In a language model, the input embeddings represent a mapping from English words to activations, and the output hidden layer represents a mapping from activations to English words. We might expect, intuitively, that these mappings could be the same. We can represent this in PyTorch by assigning the same weight matrix to each of these layers:

  1. self.h_o.weight = self.i_h.weight

In LMModel7, we include these final tweaks:

In [ ]:

  1. class LMModel7(Module):
  2. def __init__(self, vocab_sz, n_hidden, n_layers, p):
  3. self.i_h = nn.Embedding(vocab_sz, n_hidden)
  4. self.rnn = nn.LSTM(n_hidden, n_hidden, n_layers, batch_first=True)
  5. self.drop = nn.Dropout(p)
  6. self.h_o = nn.Linear(n_hidden, vocab_sz)
  7. self.h_o.weight = self.i_h.weight
  8. self.h = [torch.zeros(n_layers, bs, n_hidden) for _ in range(2)]
  9. def forward(self, x):
  10. raw,h = self.rnn(self.i_h(x), self.h)
  11. out = self.drop(raw)
  12. self.h = [h_.detach() for h_ in h]
  13. return self.h_o(out),raw,out
  14. def reset(self):
  15. for h in self.h: h.zero_()

We can create a regularized Learner using the RNNRegularizer callback:

In [ ]:

  1. learn = Learner(dls, LMModel7(len(vocab), 64, 2, 0.5),
  2. loss_func=CrossEntropyLossFlat(), metrics=accuracy,
  3. cbs=[ModelResetter, RNNRegularizer(alpha=2, beta=1)])

A TextLearner automatically adds those two callbacks for us (with those values for alpha and beta as defaults), so we can simplify the preceding line to:

In [ ]:

  1. learn = TextLearner(dls, LMModel7(len(vocab), 64, 2, 0.4),
  2. loss_func=CrossEntropyLossFlat(), metrics=accuracy)

We can then train the model, and add additional regularization by increasing the weight decay to 0.1:

In [ ]:

  1. learn.fit_one_cycle(15, 1e-2, wd=0.1)
epochtrain_lossvalid_lossaccuracytime
02.6938852.0134840.46663400:02
11.6855491.1873100.62931300:02
20.9733070.7913980.74560500:02
30.5558230.6404120.79410800:02
40.3518020.5572470.83610000:02
50.2449860.5949770.80729200:02
60.1922310.5116900.84676100:02
70.1624560.5203700.85807300:02
80.1426640.5259180.84228500:02
90.1284930.4950290.85807300:02
100.1175890.4642360.86718800:02
110.1098080.4665500.86930300:02
120.1042160.4551510.87182600:02
130.1002710.4526590.87361700:02
140.0981210.4583720.86938500:02

Now this is far better than our previous model!