12 A Language Model from Scratch - The Data - 《The fastai book》

The Data

The Data

Whenever we start working on a new problem, we always first try to think of the simplest dataset we can that will allow us to try out methods quickly and easily, and interpret the results. When we started working on language modeling a few years ago we didn’t find any datasets that would allow for quick prototyping, so we made one. We call it Human Numbers, and it simply contains the first 10,000 numbers written out in English.

j: One of the most common practical mistakes I see even amongst highly experienced practitioners is failing to use appropriate datasets at appropriate times during the analysis process. In particular, most people tend to start with datasets that are too big and too complicated.

We can download, extract, and take a look at our dataset in the usual way:

In [ ]:

from fastai.text.all import *
path = untar_data(URLs.HUMAN_NUMBERS)

In [ ]:

#hide
Path.BASE_PATH = path

In [ ]:

path.ls()

Out[ ]:

(#2) [Path('train.txt'),Path('valid.txt')]

Let’s open those two files and see what’s inside. At first we’ll join all of the texts together and ignore the train/valid split given by the dataset (we’ll come back to that later):

In [ ]:

lines = L()
with open(path/'train.txt') as f: lines += L(*f.readlines())
with open(path/'valid.txt') as f: lines += L(*f.readlines())
lines

Out[ ]:

(#9998) ['one \n','two \n','three \n','four \n','five \n','six \n','seven \n','eight \n','nine \n','ten \n'...]

We take all those lines and concatenate them in one big stream. To mark when we go from one number to the next, we use a . as a separator:

In [ ]:

text = ' . '.join([l.strip() for l in lines])
text[:100]

Out[ ]:

'one . two . three . four . five . six . seven . eight . nine . ten . eleven . twelve . thirteen . fo'

We can tokenize this dataset by splitting on spaces:

In [ ]:

tokens = text.split(' ')
tokens[:10]

Out[ ]:

['one', '.', 'two', '.', 'three', '.', 'four', '.', 'five', '.']

To numericalize, we have to create a list of all the unique tokens (our vocab):

In [ ]:

vocab = L(*tokens).unique()
vocab

Out[ ]:

(#30) ['one','.','two','three','four','five','six','seven','eight','nine'...]

Then we can convert our tokens into numbers by looking up the index of each in the vocab:

In [ ]:

word2idx = {w:i for i,w in enumerate(vocab)}
nums = L(word2idx[i] for i in tokens)
nums

Out[ ]:

(#63095) [0,1,2,1,3,1,4,1,5,1...]

Now that we have a small dataset on which language modeling should be an easy task, we can build our first model.