04 Under the Hood: Training a Digit Classifier - Adding a Nonlinearity - 《The fastai book》

Adding a Nonlinearity
- Going Deeper

Adding a Nonlinearity

So far we have a general procedure for optimizing the parameters of a function, and we have tried it out on a very boring function: a simple linear classifier. A linear classifier is very constrained in terms of what it can do. To make it a bit more complex (and able to handle more tasks), we need to add something nonlinear between two linear classifiers—this is what gives us a neural network.

Here is the entire definition of a basic neural network:

In [ ]:

def simple_net(xb): 
    res = xb@w1 + b1
    res = res.max(tensor(0.0))
    res = res@w2 + b2
    return res

That’s it! All we have in simple_net is two linear classifiers with a max function between them.

Here, w1 and w2 are weight tensors, and b1 and b2 are bias tensors; that is, parameters that are initially randomly initialized, just like we did in the previous section:

In [ ]:

w1 = init_params((28*28,30))
b1 = init_params(30)
w2 = init_params((30,1))
b2 = init_params(1)

The key point about this is that w1 has 30 output activations (which means that w2 must have 30 input activations, so they match). That means that the first layer can construct 30 different features, each representing some different mix of pixels. You can change that 30 to anything you like, to make the model more or less complex.

That little function res.max(tensor(0.0)) is called a rectified linear unit, also known as ReLU. We think we can all agree that rectified linear unit sounds pretty fancy and complicated… But actually, there’s nothing more to it than res.max(tensor(0.0))—in other words, replace every negative number with a zero. This tiny function is also available in PyTorch as F.relu:

In [ ]:

plot_function(F.relu)

J: There is an enormous amount of jargon in deep learning, including terms like rectified linear unit. The vast vast majority of this jargon is no more complicated than can be implemented in a short line of code, as we saw in this example. The reality is that for academics to get their papers published they need to make them sound as impressive and sophisticated as possible. One of the ways that they do that is to introduce jargon. Unfortunately, this has the result that the field ends up becoming far more intimidating and difficult to get into than it should be. You do have to learn the jargon, because otherwise papers and tutorials are not going to mean much to you. But that doesn’t mean you have to find the jargon intimidating. Just remember, when you come across a word or phrase that you haven’t seen before, it will almost certainly turn out to be referring to a very simple concept.

The basic idea is that by using more linear layers, we can have our model do more computation, and therefore model more complex functions. But there’s no point just putting one linear layer directly after another one, because when we multiply things together and then add them up multiple times, that could be replaced by multiplying different things together and adding them up just once! That is to say, a series of any number of linear layers in a row can be replaced with a single linear layer with a different set of parameters.

But if we put a nonlinear function between them, such as max, then this is no longer true. Now each linear layer is actually somewhat decoupled from the other ones, and can do its own useful work. The max function is particularly interesting, because it operates as a simple if statement.

S: Mathematically, we say the composition of two linear functions is another linear function. So, we can stack as many linear classifiers as we want on top of each other, and without nonlinear functions between them, it will just be the same as one linear classifier.

Amazingly enough, it can be mathematically proven that this little function can solve any computable problem to an arbitrarily high level of accuracy, if you can find the right parameters for w1 and w2 and if you make these matrices big enough. For any arbitrarily wiggly function, we can approximate it as a bunch of lines joined together; to make it closer to the wiggly function, we just have to use shorter lines. This is known as the universal approximation theorem. The three lines of code that we have here are known as layers. The first and third are known as linear layers, and the second line of code is known variously as a nonlinearity, or activation function.

Just like in the previous section, we can replace this code with something a bit simpler, by taking advantage of PyTorch:

In [ ]:

simple_net = nn.Sequential(
    nn.Linear(28*28,30),
    nn.ReLU(),
    nn.Linear(30,1)
)

nn.Sequential creates a module that will call each of the listed layers or functions in turn.

nn.ReLU is a PyTorch module that does exactly the same thing as the F.relu function. Most functions that can appear in a model also have identical forms that are modules. Generally, it’s just a case of replacing F with nn and changing the capitalization. When using nn.Sequential, PyTorch requires us to use the module version. Since modules are classes, we have to instantiate them, which is why you see nn.ReLU() in this example.

Because nn.Sequential is a module, we can get its parameters, which will return a list of all the parameters of all the modules it contains. Let’s try it out! As this is a deeper model, we’ll use a lower learning rate and a few more epochs.

In [ ]:

learn = Learner(dls, simple_net, opt_func=SGD,
                loss_func=mnist_loss, metrics=batch_accuracy)

In [ ]:

#hide_output
learn.fit(40, 0.1)

epoch	train_loss	valid_loss	batch_accuracy	time
0	0.305828	0.399663	0.508341	00:00
1	0.142960	0.225702	0.807655	00:00
2	0.079516	0.113519	0.919529	00:00
3	0.052391	0.076792	0.943081	00:00
4	0.039796	0.060083	0.956330	00:00
5	0.033368	0.050713	0.963690	00:00
6	0.029680	0.044797	0.965653	00:00
7	0.027290	0.040729	0.968106	00:00
8	0.025568	0.037771	0.968597	00:00
9	0.024233	0.035508	0.970559	00:00
10	0.023149	0.033714	0.972031	00:00
11	0.022242	0.032243	0.972522	00:00
12	0.021468	0.031006	0.973503	00:00
13	0.020796	0.029944	0.974485	00:00
14	0.020207	0.029016	0.975466	00:00
15	0.019683	0.028196	0.976448	00:00
16	0.019215	0.027463	0.976448	00:00
17	0.018791	0.026806	0.976938	00:00
18	0.018405	0.026212	0.977920	00:00
19	0.018051	0.025671	0.977920	00:00
20	0.017725	0.025179	0.977920	00:00
21	0.017422	0.024728	0.978410	00:00
22	0.017141	0.024313	0.978901	00:00
23	0.016878	0.023932	0.979392	00:00
24	0.016632	0.023580	0.979882	00:00
25	0.016400	0.023254	0.979882	00:00
26	0.016181	0.022952	0.979882	00:00
27	0.015975	0.022672	0.980864	00:00
28	0.015779	0.022411	0.980864	00:00
29	0.015593	0.022168	0.981845	00:00
30	0.015417	0.021941	0.981845	00:00
31	0.015249	0.021728	0.981845	00:00
32	0.015088	0.021529	0.981845	00:00
33	0.014935	0.021341	0.981845	00:00
34	0.014788	0.021164	0.981845	00:00
35	0.014647	0.020998	0.982336	00:00
36	0.014512	0.020840	0.982826	00:00
37	0.014382	0.020691	0.982826	00:00
38	0.014257	0.020550	0.982826	00:00
39	0.014136	0.020415	0.982826	00:00

We’re not showing the 40 lines of output here to save room; the training process is recorded in learn.recorder, with the table of output stored in the values attribute, so we can plot the accuracy over training as:

In [ ]:

plt.plot(L(learn.recorder.values).itemgot(2));

And we can view the final accuracy:

In [ ]:

learn.recorder.values[-1][2]

Out[ ]:

0.982826292514801

At this point we have something that is rather magical:

A function that can solve any problem to any level of accuracy (the neural network) given the correct set of parameters
A way to find the best set of parameters for any function (stochastic gradient descent)

This is why deep learning can do things which seem rather magical, such fantastic things. Believing that this combination of simple techniques can really solve any problem is one of the biggest steps that we find many students have to take. It seems too good to be true—surely things should be more difficult and complicated than this? Our recommendation: try it out! We just tried it on the MNIST dataset and you have seen the results. And since we are doing everything from scratch ourselves (except for calculating the gradients) you know that there is no special magic hiding behind the scenes.

Going Deeper

There is no need to stop at just two linear layers. We can add as many as we want, as long as we add a nonlinearity between each pair of linear layers. As you will learn, however, the deeper the model gets, the harder it is to optimize the parameters in practice. Later in this book you will learn about some simple but brilliantly effective techniques for training deeper models.

We already know that a single nonlinearity with two linear layers is enough to approximate any function. So why would we use deeper models? The reason is performance. With a deeper model (that is, one with more layers) we do not need to use as many parameters; it turns out that we can use smaller matrices with more layers, and get better results than we would get with larger matrices, and few layers.

That means that we can train the model more quickly, and it will take up less memory. In the 1990s researchers were so focused on the universal approximation theorem that very few were experimenting with more than one nonlinearity. This theoretical but not practical foundation held back the field for years. Some researchers, however, did experiment with deep models, and eventually were able to show that these models could perform much better in practice. Eventually, theoretical results were developed which showed why this happens. Today, it is extremely unusual to find anybody using a neural network with just one nonlinearity.

Here what happens when we train an 18-layer model using the same approach we saw in <>:

In [ ]:

dls = ImageDataLoaders.from_folder(path)
learn = cnn_learner(dls, resnet18, pretrained=False,
                    loss_func=F.cross_entropy, metrics=accuracy)
learn.fit_one_cycle(1, 0.1)

epoch	train_loss	valid_loss	accuracy	time
0	0.082089	0.009578	0.997056	00:11

Nearly 100% accuracy! That’s a big difference compared to our simple neural net. But as you’ll learn in the remainder of this book, there are just a few little tricks you need to use to get such great results from scratch yourself. You already know the key foundational pieces. (Of course, even once you know all the tricks, you’ll nearly always want to work with the pre-built classes provided by PyTorch and fastai, because they save you having to think about all the little details yourself.)