Adam

Adam mixes the ideas of SGD with momentum and RMSProp together: it uses the moving average of the gradients as a direction and divides by the square root of the moving average of the gradients squared to give an adaptive learning rate to each parameter.

There is one other difference in how Adam calculates moving averages. It takes the unbiased moving average, which is:

  1. w.avg = beta * w.avg + (1-beta) * w.grad
  2. unbias_avg = w.avg / (1 - (beta**(i+1)))

if we are the i-th iteration (starting at 0 like Python does). This divisor of 1 - (beta**(i+1)) makes sure the unbiased average looks more like the gradients at the beginning (since beta < 1, the denominator is very quickly close to 1).

Putting everything together, our update step looks like:

  1. w.avg = beta1 * w.avg + (1-beta1) * w.grad
  2. unbias_avg = w.avg / (1 - (beta1**(i+1)))
  3. w.sqr_avg = beta2 * w.sqr_avg + (1-beta2) * (w.grad ** 2)
  4. new_w = w - lr * unbias_avg / sqrt(w.sqr_avg + eps)

Like for RMSProp, eps is usually set to 1e-8, and the default for (beta1,beta2) suggested by the literature is (0.9,0.999).

In fastai, Adam is the default optimizer we use since it allows faster training, but we’ve found that beta2=0.99 is better suited to the type of schedule we are using. beta1 is the momentum parameter, which we specify with the argument moms in our call to fit_one_cycle. As for eps, fastai uses a default of 1e-5. eps is not just useful for numerical stability. A higher eps limits the maximum value of the adjusted learning rate. To take an extreme example, if eps is 1, then the adjusted learning will never be higher than the base learning rate.

Rather than show all the code for this in the book, we’ll let you look at the optimizer notebook in fastai’s GitHub repository (browse the nbs folder and search for the notebook called optimizer). You’ll see all the code we’ve shown so far, along with Adam and other optimizers, and lots of examples and tests.

One thing that changes when we go from SGD to Adam is the way we apply weight decay, and it can have important consequences.