Training callbacks

Open In Colab

Various callbacks to customize training behavior

  1. /usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
  2. return torch._C._cuda_getDeviceCount() > 0

class ShortEpochCallback[source]

ShortEpochCallback(pct=0.01, short_valid=True) :: Callback

Fit just pct of an epoch, then stop

  1. learn = synth_learner()
  2. learn.fit(1, cbs=ShortEpochCallback())
epochtrain_lossvalid_losstime
000:00
  1. learn = synth_learner()
  2. learn.fit(1, cbs=ShortEpochCallback(short_valid=False))
epochtrain_lossvalid_losstime
014.86797500:00

class GradientAccumulation[source]

GradientAccumulation(n_acc=32) :: Callback

Accumulate gradients before updating weights

When the number of steps per accumulation is higher than the number of batches, the parameters (and therefore validation loss) don’t change at all:

  1. learn = synth_learner()
  2. learn.fit(1, lr=0.01, cbs=GradientAccumulation(n_acc=1000))
  3. # ensure valid_loss didn't change
  4. assert learn.recorder.values[-1][1] == learn.recorder.values[0][1]
epochtrain_lossvalid_losstime
010.94116810.28042800:00

class GradientClip[source]

GradientClip(max_norm:float=1.0, norm_type:float=2.0) :: Callback

Clip norm of gradients

Normally if we use a learning rate that is too high, our training will diverge. This even happens if we use mixed precision training, which avoid infinities by using dynamic loss scaling, but still diverges:

  1. fp16 = MixedPrecision()
  1. set_seed(99)
  2. learn = synth_learner(lr=1.1, cuda=True)
  3. learn.fit(3, cbs=fp16)
epochtrain_lossvalid_losstime
038.21416925.26901200:00
1377.146088890.01178000:00
2839.3919079965.71289100:00

By adding the GradientClip callback, the gradient norm_type (default:2) norm is clipped to at most max_norm (default:1) using nn.utils.clip_grad_norm_, which can avoid loss divergence:

  1. set_seed(99)
  2. learn = synth_learner(lr=1.1, cuda=True)
  3. learn.fit(3, cbs=[GradientClip,fp16])
epochtrain_lossvalid_losstime
02.0394272.37218300:00
11.4024240.30072400:00
21.0135510.33266800:00

BnFreeze

set_bn_eval[source]

set_bn_eval(m:Module, use_eval=True)

Set bn layers in eval mode for all recursive children of m.

class BnFreeze[source]

BnFreeze(after_create=None, before_fit=None, before_epoch=None, before_train=None, before_batch=None, after_pred=None, after_loss=None, before_backward=None, before_step=None, after_cancel_step=None, after_step=None, after_cancel_batch=None, after_batch=None, after_cancel_train=None, after_train=None, before_validate=None, after_cancel_validate=None, after_validate=None, after_cancel_epoch=None, after_epoch=None, after_cancel_fit=None, after_fit=None) :: Callback

Basic class handling tweaks of the training loop by changing a Learner in various events

BnFreeze is useful when you’d like to train two separate models that have a common feature extractor / body. The only part of the model that’s different is the head that you attach for transfer learning.

Learner.freeze()) doesn’t suffice here as the BatchNorm layers are trainable by default, and running mean and std of batches are tracked. For feature extractors to fully match, you need to set train_bn=False and these stats need to be frozen as well, which is precisely the function of BnFreeze.

  1. path = untar_data(URLs.MNIST_TINY)
  2. dls = ImageDataLoaders.from_folder(path, valid_pct=0.2)

We first demonstrate the mismatch of the running stats when using only train_bn=False, by creating a Learner…:

  1. learn1 = cnn_learner(deepcopy(dls), resnet18, pretrained=True, train_bn=False)

…and grab the first BatchNorm layer, and store its running mean:

  1. m = learn1.model[0][1].running_mean.clone()

You can see that now that running mean has changed:

  1. learn1.fit(1, lr=0.02)
  2. test_ne(to_detach(learn1.model[0][1].running_mean), m)
epochtrain_lossvalid_losstime
01.1527010.46889200:02

When we use the BnFreeze callback, the running statistics will not be changed during training. This is often important for getting good results from transfer learning.

  1. learn1 = cnn_learner(deepcopy(dls), resnet18, pretrained=True, train_bn=False, cbs=BnFreeze)
  2. m = learn1.model[0][1].running_mean.detach().clone()
  3. learn1.fit(1, lr=0.02)
  4. test_eq(to_detach(learn1.model[0][1].running_mean), m)
epochtrain_lossvalid_losstime
00.4886340.27768300:02

Company logo

©2021 fast.ai. All rights reserved.
Site last generated: Mar 31, 2021