RMSProp

RMSProp is another variant of SGD introduced by Geoffrey Hinton in Lecture 6e of his Coursera class “Neural Networks for Machine Learning”. The main difference from SGD is that it uses an adaptive learning rate: instead of using the same learning rate for every parameter, each parameter gets its own specific learning rate controlled by a global learning rate. That way we can speed up training by giving a higher learning rate to the weights that need to change a lot while the ones that are good enough get a lower learning rate.

How do we decide which parameters should have a high learning rate and which should not? We can look at the gradients to get an idea. If a parameter’s gradients have been close to zero for a while, that parameter will need a higher learning rate because the loss is flat. On the other hand, if the gradients are all over the place, we should probably be careful and pick a low learning rate to avoid divergence. We can’t just average the gradients to see if they’re changing a lot, because the average of a large positive and a large negative number is close to zero. Instead, we can use the usual trick of either taking the absolute value or the squared values (and then taking the square root after the mean).

Once again, to determine the general tendency behind the noise, we will use a moving average—specifically the moving average of the gradients squared. Then we will update the corresponding weight by using the current gradient (for the direction) divided by the square root of this moving average (that way if it’s low, the effective learning rate will be higher, and if it’s high, the effective learning rate will be lower):

  1. w.square_avg = alpha * w.square_avg + (1-alpha) * (w.grad ** 2)
  2. new_w = w - lr * w.grad / math.sqrt(w.square_avg + eps)

The eps (epsilon) is added for numerical stability (usually set at 1e-8), and the default value for alpha is usually 0.99.

We can add this to Optimizer by doing much the same thing we did for avg_grad, but with an extra **2:

In [ ]:

  1. def average_sqr_grad(p, sqr_mom, sqr_avg=None, **kwargs):
  2. if sqr_avg is None: sqr_avg = torch.zeros_like(p.grad.data)
  3. return {'sqr_avg': sqr_mom*sqr_avg + (1-sqr_mom)*p.grad.data**2}

And we can define our step function and optimizer as before:

In [ ]:

  1. def rms_prop_step(p, lr, sqr_avg, eps, grad_avg=None, **kwargs):
  2. denom = sqr_avg.sqrt().add_(eps)
  3. p.data.addcdiv_(-lr, p.grad, denom)
  4. opt_func = partial(Optimizer, cbs=[average_sqr_grad,rms_prop_step],
  5. sqr_mom=0.99, eps=1e-7)

Let’s try it out:

In [ ]:

  1. learn = get_learner(opt_func=opt_func)
  2. learn.fit_one_cycle(3, 0.003)
epochtrain_lossvalid_lossaccuracytime
02.7669121.8459000.40254800:11
12.1945861.5102690.50445900:11
21.8690991.4479390.54496800:11

Much better! Now we just have to bring these ideas together, and we have Adam, fastai’s default optimizer.