Improving Training Stability

Since we are so good at recognizing 3s from 7s, let’s move on to something harder—recognizing all 10 digits. That means we’ll need to use MNIST instead of MNIST_SAMPLE:

In [ ]:

  1. path = untar_data(URLs.MNIST)

In [ ]:

  1. #hide
  2. Path.BASE_PATH = path

In [ ]:

  1. path.ls()

Out[ ]:

  1. (#2) [Path('testing'),Path('training')]

The data is in two folders named training and testing, so we have to tell GrandparentSplitter about that (it defaults to train and valid). We did do that in the get_dls function, which we create to make it easy to change our batch size later:

In [ ]:

  1. def get_dls(bs=64):
  2. return DataBlock(
  3. blocks=(ImageBlock(cls=PILImageBW), CategoryBlock),
  4. get_items=get_image_files,
  5. splitter=GrandparentSplitter('training','testing'),
  6. get_y=parent_label,
  7. batch_tfms=Normalize()
  8. ).dataloaders(path, bs=bs)
  9. dls = get_dls()

Remember, it’s always a good idea to look at your data before you use it:

In [ ]:

  1. dls.show_batch(max_n=9, figsize=(4,4))

Improving Training Stability - 图1

Now that we have our data ready, we can train a simple model on it.

A Simple Baseline

Earlier in this chapter, we built a model based on a conv function like this:

In [ ]:

  1. def conv(ni, nf, ks=3, act=True):
  2. res = nn.Conv2d(ni, nf, stride=2, kernel_size=ks, padding=ks//2)
  3. if act: res = nn.Sequential(res, nn.ReLU())
  4. return res

Let’s start with a basic CNN as a baseline. We’ll use the same one as earlier, but with one tweak: we’ll use more activations. Since we have more numbers to differentiate, it’s likely we will need to learn more filters.

As we discussed, we generally want to double the number of filters each time we have a stride-2 layer. One way to increase the number of filters throughout our network is to double the number of activations in the first layer–then every layer after that will end up twice as big as in the previous version as well.

But there is a subtle problem with this. Consider the kernel that is being applied to each pixel. By default, we use a 3×3-pixel kernel. That means that there are a total of 3×3 = 9 pixels that the kernel is being applied to at each location. Previously, our first layer had four output filters. That meant that there were four values being computed from nine pixels at each location. Think about what happens if we double this output to eight filters. Then when we apply our kernel we will be using nine pixels to calculate eight numbers. That means it isn’t really learning much at all: the output size is almost the same as the input size. Neural networks will only create useful features if they’re forced to do so—that is, if the number of outputs from an operation is significantly smaller than the number of inputs.

To fix this, we can use a larger kernel in the first layer. If we use a kernel of 5×5 pixels then there are 25 pixels being used at each kernel application. Creating eight filters from this will mean the neural net will have to find some useful features:

In [ ]:

  1. def simple_cnn():
  2. return sequential(
  3. conv(1 ,8, ks=5), #14x14
  4. conv(8 ,16), #7x7
  5. conv(16,32), #4x4
  6. conv(32,64), #2x2
  7. conv(64,10, act=False), #1x1
  8. Flatten(),
  9. )

As you’ll see in a moment, we can look inside our models while they’re training in order to try to find ways to make them train better. To do this we use the ActivationStats callback, which records the mean, standard deviation, and histogram of activations of every trainable layer (as we’ve seen, callbacks are used to add behavior to the training loop; we’ll explore how they work in <>):

In [ ]:

  1. from fastai.callback.hook import *

We want to train quickly, so that means training at a high learning rate. Let’s see how we go at 0.06:

In [ ]:

  1. def fit(epochs=1):
  2. learn = Learner(dls, simple_cnn(), loss_func=F.cross_entropy,
  3. metrics=accuracy, cbs=ActivationStats(with_hist=True))
  4. learn.fit(epochs, 0.06)
  5. return learn

In [ ]:

  1. learn = fit()
epochtrain_lossvalid_lossaccuracytime
02.3070712.3058650.11350000:16

This didn’t train at all well! Let’s find out why.

One handy feature of the callbacks passed to Learner is that they are made available automatically, with the same name as the callback class, except in snake_case. So, our ActivationStats callback can be accessed through activation_stats. I’m sure you remember learn.recorder… can you guess how that is implemented? That’s right, it’s a callback called Recorder!

ActivationStats includes some handy utilities for plotting the activations during training. plot_layer_stats(idx) plots the mean and standard deviation of the activations of layer number idx, along with the percentage of activations near zero. Here’s the first layer’s plot:

In [ ]:

  1. learn.activation_stats.plot_layer_stats(0)

Improving Training Stability - 图2

Generally our model should have a consistent, or at least smooth, mean and standard deviation of layer activations during training. Activations near zero are particularly problematic, because it means we have computation in the model that’s doing nothing at all (since multiplying by zero gives zero). When you have some zeros in one layer, they will therefore generally carry over to the next layer… which will then create more zeros. Here’s the penultimate layer of our network:

In [ ]:

  1. learn.activation_stats.plot_layer_stats(-2)

Improving Training Stability - 图3

As expected, the problems get worse towards the end of the network, as the instability and zero activations compound over layers. Let’s look at what we can do to make training more stable.

Increase Batch Size

One way to make training more stable is to increase the batch size. Larger batches have gradients that are more accurate, since they’re calculated from more data. On the downside, though, a larger batch size means fewer batches per epoch, which means less opportunities for your model to update weights. Let’s see if a batch size of 512 helps:

In [ ]:

  1. dls = get_dls(512)

In [ ]:

  1. learn = fit()
epochtrain_lossvalid_lossaccuracytime
02.3093852.3027440.11350000:08

Let’s see what the penultimate layer looks like:

In [ ]:

  1. learn.activation_stats.plot_layer_stats(-2)

Improving Training Stability - 图4

Again, we’ve got most of our activations near zero. Let’s see what else we can do to improve training stability.

1cycle Training

Our initial weights are not well suited to the task we’re trying to solve. Therefore, it is dangerous to begin training with a high learning rate: we may very well make the training diverge instantly, as we’ve seen. We probably don’t want to end training with a high learning rate either, so that we don’t skip over a minimum. But we want to train at a high learning rate for the rest of the training period, because we’ll be able to train more quickly that way. Therefore, we should change the learning rate during training, from low, to high, and then back to low again.

Leslie Smith (yes, the same guy that invented the learning rate finder!) developed this idea in his article “Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates”. He designed a schedule for learning rate separated into two phases: one where the learning rate grows from the minimum value to the maximum value (warmup), and one where it decreases back to the minimum value (annealing). Smith called this combination of approaches 1cycle training.

1cycle training allows us to use a much higher maximum learning rate than other types of training, which gives two benefits:

  • By training with higher learning rates, we train faster—a phenomenon Smith named super-convergence.
  • By training with higher learning rates, we overfit less because we skip over the sharp local minima to end up in a smoother (and therefore more generalizable) part of the loss.

The second point is an interesting and subtle one; it is based on the observation that a model that generalizes well is one whose loss would not change very much if you changed the input by a small amount. If a model trains at a large learning rate for quite a while, and can find a good loss when doing so, it must have found an area that also generalizes well, because it is jumping around a lot from batch to batch (that is basically the definition of a high learning rate). The problem is that, as we have discussed, just jumping to a high learning rate is more likely to result in diverging losses, rather than seeing your losses improve. So we don’t jump straight to a high learning rate. Instead, we start at a low learning rate, where our losses do not diverge, and we allow the optimizer to gradually find smoother and smoother areas of our parameters by gradually going to higher and higher learning rates.

Then, once we have found a nice smooth area for our parameters, we want to find the very best part of that area, which means we have to bring our learning rates down again. This is why 1cycle training has a gradual learning rate warmup, and a gradual learning rate cooldown. Many researchers have found that in practice this approach leads to more accurate models and trains more quickly. That is why it is the approach that is used by default for fine_tune in fastai.

In <> we’ll learn all about momentum in SGD. Briefly, momentum is a technique where the optimizer takes a step not only in the direction of the gradients, but also that continues in the direction of previous steps. Leslie Smith introduced the idea of cyclical momentums in “A Disciplined Approach to Neural Network Hyper-Parameters: Part 1”. It suggests that the momentum varies in the opposite direction of the learning rate: when we are at high learning rates, we use less momentum, and we use more again in the annealing phase.

We can use 1cycle training in fastai by calling fit_one_cycle:

In [ ]:

  1. def fit(epochs=1, lr=0.06):
  2. learn = Learner(dls, simple_cnn(), loss_func=F.cross_entropy,
  3. metrics=accuracy, cbs=ActivationStats(with_hist=True))
  4. learn.fit_one_cycle(epochs, lr)
  5. return learn

In [ ]:

  1. learn = fit()
epochtrain_lossvalid_lossaccuracytime
00.2108380.0848270.97430000:08

We’re finally making some progress! It’s giving us a reasonable accuracy now.

We can view the learning rate and momentum throughout training by calling plot_sched on learn.recorder. learn.recorder (as the name suggests) records everything that happens during training, including losses, metrics, and hyperparameters such as learning rate and momentum:

In [ ]:

  1. learn.recorder.plot_sched()

Improving Training Stability - 图5

Smith’s original 1cycle paper used a linear warmup and linear annealing. As you can see, we adapted the approach in fastai by combining it with another popular approach: cosine annealing. fit_one_cycle provides the following parameters you can adjust:

  • lr_max:: The highest learning rate that will be used (this can also be a list of learning rates for each layer group, or a Python slice object containing the first and last layer group learning rates)
  • div:: How much to divide lr_max by to get the starting learning rate
  • div_final:: How much to divide lr_max by to get the ending learning rate
  • pct_start:: What percentage of the batches to use for the warmup
  • moms:: A tuple (mom1,mom2,mom3) where mom1 is the initial momentum, mom2 is the minimum momentum, and mom3 is the final momentum

Let’s take a look at our layer stats again:

In [ ]:

  1. learn.activation_stats.plot_layer_stats(-2)

Improving Training Stability - 图6

The percentage of near-zero weights is getting much better, although it’s still quite high.

We can see even more about what’s going on in our training using color_dim, passing it a layer index:

In [ ]:

  1. learn.activation_stats.color_dim(-2)

Improving Training Stability - 图7

color_dim was developed by fast.ai in conjunction with a student, Stefano Giomo. Stefano, who refers to the idea as the colorful dimension, provides an in-depth explanation of the history and details behind the method. The basic idea is to create a histogram of the activations of a layer, which we would hope would follow a smooth pattern such as the normal distribution (colorful_dist).

Histogram in 'colorful dimension'

To create color_dim, we take the histogram shown on the left here, and convert it into just the colored representation shown at the bottom. Then we flip it on its side, as shown on the right. We found that the distribution is clearer if we take the log of the histogram values. Then, Stefano describes:

: The final plot for each layer is made by stacking the histogram of the activations from each batch along the horizontal axis. So each vertical slice in the visualisation represents the histogram of activations for a single batch. The color intensity corresponds to the height of the histogram, in other words the number of activations in each histogram bin.

<> shows how this all fits together.

Summary of the colorful dimension

This illustrates why log(f) is more colorful than f when f follows a normal distribution because taking a log changes the Gaussian in a quadratic, which isn’t as narrow.

So with that in mind, let’s take another look at the result for the penultimate layer:

In [ ]:

  1. learn.activation_stats.color_dim(-2)

Improving Training Stability - 图10

This shows a classic picture of “bad training.” We start with nearly all activations at zero—that’s what we see at the far left, with all the dark blue. The bright yellow at the bottom represents the near-zero activations. Then, over the first few batches we see the number of nonzero activations exponentially increasing. But it goes too far, and collapses! We see the dark blue return, and the bottom becomes bright yellow again. It almost looks like training restarts from scratch. Then we see the activations increase again, and collapse again. After repeating this a few times, eventually we see a spread of activations throughout the range.

It’s much better if training can be smooth from the start. The cycles of exponential increase and then collapse tend to result in a lot of near-zero activations, resulting in slow training and poor final results. One way to solve this problem is to use batch normalization.

Batch Normalization

To fix the slow training and poor final results we ended up with in the previous section, we need to fix the initial large percentage of near-zero activations, and then try to maintain a good distribution of activations throughout training.

Sergey Ioffe and Christian Szegedy presented a solution to this problem in the 2015 paper “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”. In the abstract, they describe just the problem that we’ve seen:

: Training Deep Neural Networks is complicated by the fact that the distribution of each layer’s inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization… We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs.

Their solution, they say is:

: Making normalization a part of the model architecture and performing the normalization for each training mini-batch. Batch Normalization allows us to use much higher learning rates and be less careful about initialization.

The paper caused great excitement as soon as it was released, because it included the chart in <>, which clearly demonstrated that batch normalization could train a model that was even more accurate than the current state of the art (the Inception architecture) and around 5x faster.

Impact of batch normalization

Batch normalization (often just called batchnorm) works by taking an average of the mean and standard deviations of the activations of a layer and using those to normalize the activations. However, this can cause problems because the network might want some activations to be really high in order to make accurate predictions. So they also added two learnable parameters (meaning they will be updated in the SGD step), usually called gamma and beta. After normalizing the activations to get some new activation vector y, a batchnorm layer returns gamma*y + beta.

That’s why our activations can have any mean or variance, independent from the mean and standard deviation of the results of the previous layer. Those statistics are learned separately, making training easier on our model. The behavior is different during training and validation: during training, we use the mean and standard deviation of the batch to normalize the data, while during validation we instead use a running mean of the statistics calculated during training.

Let’s add a batchnorm layer to conv:

In [ ]:

  1. def conv(ni, nf, ks=3, act=True):
  2. layers = [nn.Conv2d(ni, nf, stride=2, kernel_size=ks, padding=ks//2)]
  3. if act: layers.append(nn.ReLU())
  4. layers.append(nn.BatchNorm2d(nf))
  5. return nn.Sequential(*layers)

and fit our model:

In [ ]:

  1. learn = fit()
epochtrain_lossvalid_lossaccuracytime
00.1300360.0550210.98640000:10

That’s a great result! Let’s take a look at color_dim:

In [ ]:

  1. learn.activation_stats.color_dim(-4)

Improving Training Stability - 图12

This is just what we hope to see: a smooth development of activations, with no “crashes.” Batchnorm has really delivered on its promise here! In fact, batchnorm has been so successful that we see it (or something very similar) in nearly all modern neural networks.

An interesting observation about models containing batch normalization layers is that they tend to generalize better than models that don’t contain them. Although we haven’t as yet seen a rigorous analysis of what’s going on here, most researchers believe that the reason for this is that batch normalization adds some extra randomness to the training process. Each mini-batch will have a somewhat different mean and standard deviation than other mini-batches. Therefore, the activations will be normalized by different values each time. In order for the model to make accurate predictions, it will have to learn to become robust to these variations. In general, adding additional randomization to the training process often helps.

Since things are going so well, let’s train for a few more epochs and see how it goes. In fact, let’s increase the learning rate, since the abstract of the batchnorm paper claimed we should be able to “train at much higher learning rates”:

In [ ]:

  1. learn = fit(5, lr=0.1)
epochtrain_lossvalid_lossaccuracytime
00.1917310.1217380.96090000:11
10.0837390.0558080.98180000:10
20.0531610.0444850.98710000:10
30.0344330.0302330.99020000:10
40.0176460.0254070.99120000:10

At this point, I think it’s fair to say we know how to recognize digits! It’s time to move on to something harder…