Natural Language Processing

Converting an AWD-LSTM language model into a transfer learning classifier, as we did in <>, follows a very similar process to what we did with cnn_learner in the first section of this chapter. We do not need a “meta” dictionary in this case, because we do not have such a variety of architectures to support in the body. All we need to do is select the stacked RNN for the encoder in the language model, which is a single PyTorch module. This encoder will provide an activation for every word of the input, because a language model needs to output a prediction for every next word.

To create a classifier from this we use an approach described in the ULMFiT paper as “BPTT for Text Classification (BPT3C)”:

: We divide the document into fixed-length batches of size b. At the beginning of each batch, the model is initialized with the final state of the previous batch; we keep track of the hidden states for mean and max-pooling; gradients are back-propagated to the batches whose hidden states contributed to the final prediction. In practice, we use variable length backpropagation sequences.

In other words, the classifier contains a for loop, which loops over each batch of a sequence. The state is maintained across batches, and the activations of each batch are stored. At the end, we use the same average and max concatenated pooling trick that we use for computer vision models—but this time, we do not pool over CNN grid cells, but over RNN sequences.

For this for loop we need to gather our data in batches, but each text needs to be treated separately, as they each have their own labels. However, it’s very likely that those texts won’t all be of the same length, which means we won’t be able to put them all in the same array, like we did with the language model.

That’s where padding is going to help: when grabbing a bunch of texts, we determine the one with the greatest length, then we fill the ones that are shorter with a special token called xxpad. To avoid extreme cases where we have a text with 2,000 tokens in the same batch as a text with 10 tokens (so a lot of padding, and a lot of wasted computation), we alter the randomness by making sure texts of comparable size are put together. The texts will still be in a somewhat random order for the training set (for the validation set we can simply sort them by order of length), but not completely so.

This is done automatically behind the scenes by the fastai library when creating our DataLoaders.