Our First Language Model from Scratch

One simple way to turn this into a neural network would be to specify that we are going to predict each word based on the previous three words. We could create a list of every sequence of three words as our independent variables, and the next word after each sequence as the dependent variable.

We can do that with plain Python. Let’s do it first with tokens just to confirm what it looks like:

In [ ]:

  1. L((tokens[i:i+3], tokens[i+3]) for i in range(0,len(tokens)-4,3))

Out[ ]:

  1. (#21031) [(['one', '.', 'two'], '.'),(['.', 'three', '.'], 'four'),(['four', '.', 'five'], '.'),(['.', 'six', '.'], 'seven'),(['seven', '.', 'eight'], '.'),(['.', 'nine', '.'], 'ten'),(['ten', '.', 'eleven'], '.'),(['.', 'twelve', '.'], 'thirteen'),(['thirteen', '.', 'fourteen'], '.'),(['.', 'fifteen', '.'], 'sixteen')...]

Now we will do it with tensors of the numericalized values, which is what the model will actually use:

In [ ]:

  1. seqs = L((tensor(nums[i:i+3]), nums[i+3]) for i in range(0,len(nums)-4,3))
  2. seqs

Out[ ]:

  1. (#21031) [(tensor([0, 1, 2]), 1),(tensor([1, 3, 1]), 4),(tensor([4, 1, 5]), 1),(tensor([1, 6, 1]), 7),(tensor([7, 1, 8]), 1),(tensor([1, 9, 1]), 10),(tensor([10, 1, 11]), 1),(tensor([ 1, 12, 1]), 13),(tensor([13, 1, 14]), 1),(tensor([ 1, 15, 1]), 16)...]

We can batch those easily using the DataLoader class. For now we will split the sequences randomly:

In [ ]:

  1. bs = 64
  2. cut = int(len(seqs) * 0.8)
  3. dls = DataLoaders.from_dsets(seqs[:cut], seqs[cut:], bs=64, shuffle=False)

We can now create a neural network architecture that takes three words as input, and returns a prediction of the probability of each possible next word in the vocab. We will use three standard linear layers, but with two tweaks.

The first tweak is that the first linear layer will use only the first word’s embedding as activations, the second layer will use the second word’s embedding plus the first layer’s output activations, and the third layer will use the third word’s embedding plus the second layer’s output activations. The key effect of this is that every word is interpreted in the information context of any words preceding it.

The second tweak is that each of these three layers will use the same weight matrix. The way that one word impacts the activations from previous words should not change depending on the position of a word. In other words, activation values will change as data moves through the layers, but the layer weights themselves will not change from layer to layer. So, a layer does not learn one sequence position; it must learn to handle all positions.

Since layer weights do not change, you might think of the sequential layers as “the same layer” repeated. In fact, PyTorch makes this concrete; we can just create one layer, and use it multiple times.

Our Language Model in PyTorch

We can now create the language model module that we described earlier:

In [ ]:

  1. class LMModel1(Module):
  2. def __init__(self, vocab_sz, n_hidden):
  3. self.i_h = nn.Embedding(vocab_sz, n_hidden)
  4. self.h_h = nn.Linear(n_hidden, n_hidden)
  5. self.h_o = nn.Linear(n_hidden,vocab_sz)
  6. def forward(self, x):
  7. h = F.relu(self.h_h(self.i_h(x[:,0])))
  8. h = h + self.i_h(x[:,1])
  9. h = F.relu(self.h_h(h))
  10. h = h + self.i_h(x[:,2])
  11. h = F.relu(self.h_h(h))
  12. return self.h_o(h)

As you see, we have created three layers:

  • The embedding layer (i_h, for input to hidden)
  • The linear layer to create the activations for the next word (h_h, for hidden to hidden)
  • A final linear layer to predict the fourth word (h_o, for hidden to output)

This might be easier to represent in pictorial form, so let’s define a simple pictorial representation of basic neural networks. <> shows how we’re going to represent a neural net with one hidden layer.

Pictorial representation of simple neural network

Each shape represents activations: rectangle for input, circle for hidden (inner) layer activations, and triangle for output activations. We will use those shapes (summarized in <>) in all the diagrams in this chapter.

Shapes used in our pictorial representations

An arrow represents the actual layer computation—i.e., the linear layer followed by the activation function. Using this notation, <> shows what our simple language model looks like.

Representation of our basic language model

To simplify things, we’ve removed the details of the layer computation from each arrow. We’ve also color-coded the arrows, such that all arrows with the same color have the same weight matrix. For instance, all the input layers use the same embedding matrix, so they all have the same color (green).

Let’s try training this model and see how it goes:

In [ ]:

  1. learn = Learner(dls, LMModel1(len(vocab), 64), loss_func=F.cross_entropy,
  2. metrics=accuracy)
  3. learn.fit_one_cycle(4, 1e-3)
epochtrain_lossvalid_lossaccuracytime
01.8242971.9709410.46755400:02
11.3869731.8232420.46755400:02
21.4175561.6544970.49441400:02
31.3764401.6508490.49441400:02

To see if this is any good, let’s check what a very simple model would give us. In this case we could always predict the most common token, so let’s find out which token is most often the target in our validation set:

In [ ]:

  1. n,counts = 0,torch.zeros(len(vocab))
  2. for x,y in dls.valid:
  3. n += y.shape[0]
  4. for i in range_of(vocab): counts[i] += (y==i).long().sum()
  5. idx = torch.argmax(counts)
  6. idx, vocab[idx.item()], counts[idx].item()/n

Out[ ]:

  1. (tensor(29), 'thousand', 0.15165200855716662)

The most common token has the index 29, which corresponds to the token thousand. Always predicting this token would give us an accuracy of roughly 15\%, so we are faring way better!

A: My first guess was that the separator would be the most common token, since there is one for every number. But looking at tokens reminded me that large numbers are written with many words, so on the way to 10,000 you write “thousand” a lot: five thousand, five thousand and one, five thousand and two, etc. Oops! Looking at your data is great for noticing subtle features and also embarrassingly obvious ones.

This is a nice first baseline. Let’s see how we can refactor it with a loop.

Our First Recurrent Neural Network

Looking at the code for our module, we could simplify it by replacing the duplicated code that calls the layers with a for loop. As well as making our code simpler, this will also have the benefit that we will be able to apply our module equally well to token sequences of different lengths—we won’t be restricted to token lists of length three:

In [ ]:

  1. class LMModel2(Module):
  2. def __init__(self, vocab_sz, n_hidden):
  3. self.i_h = nn.Embedding(vocab_sz, n_hidden)
  4. self.h_h = nn.Linear(n_hidden, n_hidden)
  5. self.h_o = nn.Linear(n_hidden,vocab_sz)
  6. def forward(self, x):
  7. h = 0
  8. for i in range(3):
  9. h = h + self.i_h(x[:,i])
  10. h = F.relu(self.h_h(h))
  11. return self.h_o(h)

Let’s check that we get the same results using this refactoring:

In [ ]:

  1. learn = Learner(dls, LMModel2(len(vocab), 64), loss_func=F.cross_entropy,
  2. metrics=accuracy)
  3. learn.fit_one_cycle(4, 1e-3)
epochtrain_lossvalid_lossaccuracytime
01.8162741.9641430.46018500:02
11.4238051.7399640.47325900:02
21.4303271.6851720.48538200:02
31.3883901.6570330.47040600:02

We can also refactor our pictorial representation in exactly the same way, as shown in <> (we’re also removing the details of activation sizes here, and using the same arrow colors as in <>).

Basic recurrent neural network

You will see that there is a set of activations that are being updated each time through the loop, stored in the variable h—this is called the hidden state.

Jargon: hidden state: The activations that are updated at each step of a recurrent neural network.

A neural network that is defined using a loop like this is called a recurrent neural network (RNN). It is important to realize that an RNN is not a complicated new architecture, but simply a refactoring of a multilayer neural network using a for loop.

A: My true opinion: if they were called “looping neural networks,” or LNNs, they would seem 50% less daunting!

Now that we know what an RNN is, let’s try to make it a little bit better.