Stochastic Gradient Descent (SGD)

Do you remember the way that Arthur Samuel described machine learning, which we quoted in <>?

: Suppose we arrange for some automatic means of testing the effectiveness of any current weight assignment in terms of actual performance and provide a mechanism for altering the weight assignment so as to maximize the performance. We need not go into the details of such a procedure to see that it could be made entirely automatic and to see that a machine so programmed would “learn” from its experience.

As we discussed, this is the key to allowing us to have a model that can get better and better—that can learn. But our pixel similarity approach does not really do this. We do not have any kind of weight assignment, or any way of improving based on testing the effectiveness of a weight assignment. In other words, we can’t really improve our pixel similarity approach by modifying a set of parameters. In order to take advantage of the power of deep learning, we will first have to represent our task in the way that Arthur Samuel described it.

Instead of trying to find the similarity between an image and an “ideal image,” we could instead look at each individual pixel and come up with a set of weights for each one, such that the highest weights are associated with those pixels most likely to be black for a particular category. For instance, pixels toward the bottom right are not very likely to be activated for a 7, so they should have a low weight for a 7, but they are likely to be activated for an 8, so they should have a high weight for an 8. This can be represented as a function and set of weight values for each possible category—for instance the probability of being the number 8:

  1. def pr_eight(x,w): return (x*w).sum()

Here we are assuming that x is the image, represented as a vector—in other words, with all of the rows stacked up end to end into a single long line. And we are assuming that the weights are a vector w. If we have this function, then we just need some way to update the weights to make them a little bit better. With such an approach, we can repeat that step a number of times, making the weights better and better, until they are as good as we can make them.

We want to find the specific values for the vector w that causes the result of our function to be high for those images that are actually 8s, and low for those images that are not. Searching for the best vector w is a way to search for the best function for recognising 8s. (Because we are not yet using a deep neural network, we are limited by what our function can actually do—we are going to fix that constraint later in this chapter.)

To be more specific, here are the steps that we are going to require, to turn this function into a machine learning classifier:

  1. Initialize the weights.
  2. For each image, use these weights to predict whether it appears to be a 3 or a 7.
  3. Based on these predictions, calculate how good the model is (its loss).
  4. Calculate the gradient, which measures for each weight, how changing that weight would change the loss
  5. Step (that is, change) all the weights based on that calculation.
  6. Go back to the step 2, and repeat the process.
  7. Iterate until you decide to stop the training process (for instance, because the model is good enough or you don’t want to wait any longer).

These seven steps, illustrated in <>, are the key to the training of all deep learning models. That deep learning turns out to rely entirely on these steps is extremely surprising and counterintuitive. It’s amazing that this process can solve such complex problems. But, as you’ll see, it really does!

In [ ]:

  1. #id gradient_descent
  2. #caption The gradient descent process
  3. #alt Graph showing the steps for Gradient Descent
  4. gv('''
  5. init->predict->loss->gradient->step->stop
  6. step->predict[label=repeat]
  7. ''')

Out[ ]:

Stochastic Gradient Descent (SGD) - 图1

There are many different ways to do each of these seven steps, and we will be learning about them throughout the rest of this book. These are the details that make a big difference for deep learning practitioners, but it turns out that the general approach to each one generally follows some basic principles. Here are a few guidelines:

  • Initialize:: We initialize the parameters to random values. This may sound surprising. There are certainly other choices we could make, such as initializing them to the percentage of times that pixel is activated for that category—but since we already know that we have a routine to improve these weights, it turns out that just starting with random weights works perfectly well.
  • Loss:: This is what Samuel referred to when he spoke of testing the effectiveness of any current weight assignment in terms of actual performance. We need some function that will return a number that is small if the performance of the model is good (the standard approach is to treat a small loss as good, and a large loss as bad, although this is just a convention).
  • Step:: A simple way to figure out whether a weight should be increased a bit, or decreased a bit, would be just to try it: increase the weight by a small amount, and see if the loss goes up or down. Once you find the correct direction, you could then change that amount by a bit more, and a bit less, until you find an amount that works well. However, this is slow! As we will see, the magic of calculus allows us to directly figure out in which direction, and by roughly how much, to change each weight, without having to try all these small changes. The way to do this is by calculating gradients. This is just a performance optimization, we would get exactly the same results by using the slower manual process as well.
  • Stop:: Once we’ve decided how many epochs to train the model for (a few suggestions for this were given in the earlier list), we apply that decision. This is where that decision is applied. For our digit classifier, we would keep training until the accuracy of the model started getting worse, or we ran out of time.

Before applying these steps to our image classification problem, let’s illustrate what they look like in a simpler case. First we will define a very simple function, the quadratic—let’s pretend that this is our loss function, and x is a weight parameter of the function:

In [ ]:

  1. def f(x): return x**2

Here is a graph of that function:

In [ ]:

  1. plot_function(f, 'x', 'x**2')

Stochastic Gradient Descent (SGD) - 图2

The sequence of steps we described earlier starts by picking some random value for a parameter, and calculating the value of the loss:

In [ ]:

  1. plot_function(f, 'x', 'x**2')
  2. plt.scatter(-1.5, f(-1.5), color='red');

Stochastic Gradient Descent (SGD) - 图3

Now we look to see what would happen if we increased or decreased our parameter by a little bit—the adjustment. This is simply the slope at a particular point:

A graph showing the squared function with the slope at one point

We can change our weight by a little in the direction of the slope, calculate our loss and adjustment again, and repeat this a few times. Eventually, we will get to the lowest point on our curve:

An illustration of gradient descent

This basic idea goes all the way back to Isaac Newton, who pointed out that we can optimize arbitrary functions in this way. Regardless of how complicated our functions become, this basic approach of gradient descent will not significantly change. The only minor changes we will see later in this book are some handy ways we can make it faster, by finding better steps.

Calculating Gradients

The one magic step is the bit where we calculate the gradients. As we mentioned, we use calculus as a performance optimization; it allows us to more quickly calculate whether our loss will go up or down when we adjust our parameters up or down. In other words, the gradients will tell us how much we have to change each weight to make our model better.

You may remember from your high school calculus class that the derivative of a function tells you how much a change in its parameters will change its result. If not, don’t worry, lots of us forget calculus once high school is behind us! But you will have to have some intuitive understanding of what a derivative is before you continue, so if this is all very fuzzy in your head, head over to Khan Academy and complete the lessons on basic derivatives. You won’t have to know how to calculate them yourselves, you just have to know what a derivative is.

The key point about a derivative is this: for any function, such as the quadratic function we saw in the previous section, we can calculate its derivative. The derivative is another function. It calculates the change, rather than the value. For instance, the derivative of the quadratic function at the value 3 tells us how rapidly the function changes at the value 3. More specifically, you may recall that gradient is defined as rise/run, that is, the change in the value of the function, divided by the change in the value of the parameter. When we know how our function will change, then we know what we need to do to make it smaller. This is the key to machine learning: having a way to change the parameters of a function to make it smaller. Calculus provides us with a computational shortcut, the derivative, which lets us directly calculate the gradients of our functions.

One important thing to be aware of is that our function has lots of weights that we need to adjust, so when we calculate the derivative we won’t get back one number, but lots of them—a gradient for every weight. But there is nothing mathematically tricky here; you can calculate the derivative with respect to one weight, and treat all the other ones as constant, then repeat that for each other weight. This is how all of the gradients are calculated, for every weight.

We mentioned just now that you won’t have to calculate any gradients yourself. How can that be? Amazingly enough, PyTorch is able to automatically compute the derivative of nearly any function! What’s more, it does it very fast. Most of the time, it will be at least as fast as any derivative function that you can create by hand. Let’s see an example.

First, let’s pick a tensor value which we want gradients at:

In [ ]:

  1. xt = tensor(3.).requires_grad_()

Notice the special method requires_grad_? That’s the magical incantation we use to tell PyTorch that we want to calculate gradients with respect to that variable at that value. It is essentially tagging the variable, so PyTorch will remember to keep track of how to compute gradients of the other, direct calculations on it that you will ask for.

a: This API might throw you off if you’re coming from math or physics. In those contexts the “gradient” of a function is just another function (i.e., its derivative), so you might expect gradient-related APIs to give you a new function. But in deep learning, “gradients” usually means the value of a function’s derivative at a particular argument value. The PyTorch API also puts the focus on the argument, not the function you’re actually computing the gradients of. It may feel backwards at first, but it’s just a different perspective.

Now we calculate our function with that value. Notice how PyTorch prints not just the value calculated, but also a note that it has a gradient function it’ll be using to calculate our gradients when needed:

In [ ]:

  1. yt = f(xt)
  2. yt

Out[ ]:

  1. tensor(9., grad_fn=<PowBackward0>)

Finally, we tell PyTorch to calculate the gradients for us:

In [ ]:

  1. yt.backward()

The “backward” here refers to backpropagation, which is the name given to the process of calculating the derivative of each layer. We’ll see how this is done exactly in chapter <>, when we calculate the gradients of a deep neural net from scratch. This is called the “backward pass” of the network, as opposed to the “forward pass,” which is where the activations are calculated. Life would probably be easier if backward was just called calculate_grad, but deep learning folks really do like to add jargon everywhere they can!

We can now view the gradients by checking the grad attribute of our tensor:

In [ ]:

  1. xt.grad

Out[ ]:

  1. tensor(6.)

If you remember your high school calculus rules, the derivative of x**2 is 2*x, and we have x=3, so the gradients should be 2*3=6, which is what PyTorch calculated for us!

Now we’ll repeat the preceding steps, but with a vector argument for our function:

In [ ]:

  1. xt = tensor([3.,4.,10.]).requires_grad_()
  2. xt

Out[ ]:

  1. tensor([ 3., 4., 10.], requires_grad=True)

And we’ll add sum to our function so it can take a vector (i.e., a rank-1 tensor), and return a scalar (i.e., a rank-0 tensor):

In [ ]:

  1. def f(x): return (x**2).sum()
  2. yt = f(xt)
  3. yt

Out[ ]:

  1. tensor(125., grad_fn=<SumBackward0>)

Our gradients are 2*xt, as we’d expect!

In [ ]:

  1. yt.backward()
  2. xt.grad

Out[ ]:

  1. tensor([ 6., 8., 20.])

The gradients only tell us the slope of our function, they don’t actually tell us exactly how far to adjust the parameters. But it gives us some idea of how far; if the slope is very large, then that may suggest that we have more adjustments to do, whereas if the slope is very small, that may suggest that we are close to the optimal value.

Stepping With a Learning Rate

Deciding how to change our parameters based on the values of the gradients is an important part of the deep learning process. Nearly all approaches start with the basic idea of multiplying the gradient by some small number, called the learning rate (LR). The learning rate is often a number between 0.001 and 0.1, although it could be anything. Often, people select a learning rate just by trying a few, and finding which results in the best model after training (we’ll show you a better approach later in this book, called the learning rate finder). Once you’ve picked a learning rate, you can adjust your parameters using this simple function:

  1. w -= gradient(w) * lr

This is known as stepping your parameters, using an optimizer step.

If you pick a learning rate that’s too low, it can mean having to do a lot of steps. <> illustrates that.

An illustration of gradient descent with a LR too low

But picking a learning rate that’s too high is even worse—it can actually result in the loss getting worse, as we see in <>!

An illustration of gradient descent with a LR too high

If the learning rate is too high, it may also “bounce” around, rather than actually diverging; <> shows how this has the result of taking many steps to train successfully.

An illustation of gradient descent with a bouncy LR

Now let’s apply all of this in an end-to-end example.

An End-to-End SGD Example

We’ve seen how to use gradients to find a minimum. Now it’s time to look at an SGD example and see how finding a minimum can be used to train a model to fit data better.

Let’s start with a simple, synthetic, example model. Imagine you were measuring the speed of a roller coaster as it went over the top of a hump. It would start fast, and then get slower as it went up the hill; it would be slowest at the top, and it would then speed up again as it went downhill. You want to build a model of how the speed changes over time. If you were measuring the speed manually every second for 20 seconds, it might look something like this:

In [ ]:

  1. time = torch.arange(0,20).float(); time

Out[ ]:

  1. tensor([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9., 10., 11., 12., 13., 14., 15., 16., 17., 18., 19.])

In [ ]:

  1. speed = torch.randn(20)*3 + 0.75*(time-9.5)**2 + 1
  2. plt.scatter(time,speed);

Stochastic Gradient Descent (SGD) - 图9

We’ve added a bit of random noise, since measuring things manually isn’t precise. This means it’s not that easy to answer the question: what was the roller coaster’s speed? Using SGD we can try to find a function that matches our observations. We can’t consider every possible function, so let’s use a guess that it will be quadratic; i.e., a function of the form a*(time**2)+(b*time)+c.

We want to distinguish clearly between the function’s input (the time when we are measuring the coaster’s speed) and its parameters (the values that define which quadratic we’re trying). So, let’s collect the parameters in one argument and thus separate the input, t, and the parameters, params, in the function’s signature:

In [ ]:

  1. def f(t, params):
  2. a,b,c = params
  3. return a*(t**2) + (b*t) + c

In other words, we’ve restricted the problem of finding the best imaginable function that fits the data, to finding the best quadratic function. This greatly simplifies the problem, since every quadratic function is fully defined by the three parameters a, b, and c. Thus, to find the best quadratic function, we only need to find the best values for a, b, and c.

If we can solve this problem for the three parameters of a quadratic function, we’ll be able to apply the same approach for other, more complex functions with more parameters—such as a neural net. Let’s find the parameters for f first, and then we’ll come back and do the same thing for the MNIST dataset with a neural net.

We need to define first what we mean by “best.” We define this precisely by choosing a loss function, which will return a value based on a prediction and a target, where lower values of the function correspond to “better” predictions. For continuous data, it’s common to use mean squared error:

In [ ]:

  1. def mse(preds, targets): return ((preds-targets)**2).mean().sqrt()

Now, let’s work through our 7 step process.

Step 1: Initialize the parameters

First, we initialize the parameters to random values, and tell PyTorch that we want to track their gradients, using requires_grad_:

In [ ]:

  1. params = torch.randn(3).requires_grad_()

In [ ]:

  1. #hide
  2. orig_params = params.clone()

Step 2: Calculate the predictions

Next, we calculate the predictions:

In [ ]:

  1. preds = f(time, params)

Let’s create a little function to see how close our predictions are to our targets, and take a look:

In [ ]:

  1. def show_preds(preds, ax=None):
  2. if ax is None: ax=plt.subplots()[1]
  3. ax.scatter(time, speed)
  4. ax.scatter(time, to_np(preds), color='red')
  5. ax.set_ylim(-300,100)

In [ ]:

  1. show_preds(preds)

Stochastic Gradient Descent (SGD) - 图10

This doesn’t look very close—our random parameters suggest that the roller coaster will end up going backwards, since we have negative speeds!

Step 3: Calculate the loss

We calculate the loss as follows:

In [ ]:

  1. loss = mse(preds, speed)
  2. loss

Out[ ]:

  1. tensor(25823.8086, grad_fn=<MeanBackward0>)

Our goal is now to improve this. To do that, we’ll need to know the gradients.

Step 4: Calculate the gradients

The next step is to calculate the gradients. In other words, calculate an approximation of how the parameters need to change:

In [ ]:

  1. loss.backward()
  2. params.grad

Out[ ]:

  1. tensor([-53195.8594, -3419.7146, -253.8908])

In [ ]:

  1. params.grad * 1e-5

Out[ ]:

  1. tensor([-0.5320, -0.0342, -0.0025])

We can use these gradients to improve our parameters. We’ll need to pick a learning rate (we’ll discuss how to do that in practice in the next chapter; for now we’ll just use 1e-5, or 0.00001):

In [ ]:

  1. params

Out[ ]:

  1. tensor([-0.7658, -0.7506, 1.3525], requires_grad=True)

Step 5: Step the weights.

Now we need to update the parameters based on the gradients we just calculated:

In [ ]:

  1. lr = 1e-5
  2. params.data -= lr * params.grad.data
  3. params.grad = None

a: Understanding this bit depends on remembering recent history. To calculate the gradients we call backward on the loss. But this loss was itself calculated by mse, which in turn took preds as an input, which was calculated using f taking as an input params, which was the object on which we originally called required_grads_—which is the original call that now allows us to call backward on loss. This chain of function calls represents the mathematical composition of functions, which enables PyTorch to use calculus’s chain rule under the hood to calculate these gradients.

Let’s see if the loss has improved:

In [ ]:

  1. preds = f(time,params)
  2. mse(preds, speed)

Out[ ]:

  1. tensor(5435.5366, grad_fn=<MeanBackward0>)

And take a look at the plot:

In [ ]:

  1. show_preds(preds)

Stochastic Gradient Descent (SGD) - 图11

We need to repeat this a few times, so we’ll create a function to apply one step:

In [ ]:

  1. def apply_step(params, prn=True):
  2. preds = f(time, params)
  3. loss = mse(preds, speed)
  4. loss.backward()
  5. params.data -= lr * params.grad.data
  6. params.grad = None
  7. if prn: print(loss.item())
  8. return preds

Step 6: Repeat the process

Now we iterate. By looping and performing many improvements, we hope to reach a good result:

In [ ]:

  1. for i in range(10): apply_step(params)
  1. 5435.53662109375
  2. 1577.4495849609375
  3. 847.3780517578125
  4. 709.22265625
  5. 683.0757446289062
  6. 678.12451171875
  7. 677.1839599609375
  8. 677.0025024414062
  9. 676.96435546875
  10. 676.9537353515625

In [ ]:

  1. #hide
  2. params = orig_params.detach().requires_grad_()

The loss is going down, just as we hoped! But looking only at these loss numbers disguises the fact that each iteration represents an entirely different quadratic function being tried, on the way to finding the best possible quadratic function. We can see this process visually if, instead of printing out the loss function, we plot the function at every step. Then we can see how the shape is approaching the best possible quadratic function for our data:

In [ ]:

  1. _,axs = plt.subplots(1,4,figsize=(12,3))
  2. for ax in axs: show_preds(apply_step(params, False), ax)
  3. plt.tight_layout()

Stochastic Gradient Descent (SGD) - 图12

Step 7: stop

We just decided to stop after 10 epochs arbitrarily. In practice, we would watch the training and validation losses and our metrics to decide when to stop, as we’ve discussed.

Summarizing Gradient Descent

In [ ]:

  1. #hide_input
  2. #id gradient_descent
  3. #caption The gradient descent process
  4. #alt Graph showing the steps for Gradient Descent
  5. gv('''
  6. init->predict->loss->gradient->step->stop
  7. step->predict[label=repeat]
  8. ''')

Out[ ]:

Stochastic Gradient Descent (SGD) - 图13

To summarize, at the beginning, the weights of our model can be random (training from scratch) or come from a pretrained model (transfer learning). In the first case, the output we will get from our inputs won’t have anything to do with what we want, and even in the second case, it’s very likely the pretrained model won’t be very good at the specific task we are targeting. So the model will need to learn better weights.

We begin by comparing the outputs the model gives us with our targets (we have labeled data, so we know what result the model should give) using a loss function, which returns a number that we want to make as low as possible by improving our weights. To do this, we take a few data items (such as images) from the training set and feed them to our model. We compare the corresponding targets using our loss function, and the score we get tells us how wrong our predictions were. We then change the weights a little bit to make it slightly better.

To find how to change the weights to make the loss a bit better, we use calculus to calculate the gradients. (Actually, we let PyTorch do it for us!) Let’s consider an analogy. Imagine you are lost in the mountains with your car parked at the lowest point. To find your way back to it, you might wander in a random direction, but that probably wouldn’t help much. Since you know your vehicle is at the lowest point, you would be better off going downhill. By always taking a step in the direction of the steepest downward slope, you should eventually arrive at your destination. We use the magnitude of the gradient (i.e., the steepness of the slope) to tell us how big a step to take; specifically, we multiply the gradient by a number we choose called the learning rate to decide on the step size. We then iterate until we have reached the lowest point, which will be our parking lot, then we can stop.

All of that we just saw can be transposed directly to the MNIST dataset, except for the loss function. Let’s now see how we can define a good training objective.