应用梯度累积算法
Linux GPU 模型调优 中级 高级
概述
本教程介绍梯度累积的训练方式,目的是为了解决由于内存不足导致某些大型网络无法训练大Batch_size的问题。
传统的训练方式是每次计算得到loss和梯度后,直接用所得梯度对参数进行更新。
与传统的训练方式不同,梯度累积引入Mini-batch的概念,首先对每个Mini-batch的数据计算loss和梯度,但不立即更新模型参数,而是先对所得梯度进行累加,然后在指定数量(N)个Mini-batch之后,用累积后的梯度更新网络参数。下次训练前清空过往累积梯度后重新累加,如此往复。
最终目的是为了达到跟直接用N*Mini-batch数据训练几乎同样的效果。
本教程用于GPU, 你可以在这里下载主要的训练样例代码:https://gitee.com/mindspore/docs/tree/r1.0/tutorials/tutorial_code/gradient_accumulation
创建梯度累积模型
以MNIST作为示范数据集,自定义简单模型实现梯度累积。
导入需要的库文件
下列是我们所需要的公共模块及MindSpore的模块及库文件。
import argparseimport osfrom collections.abc import Iterableimport mindspore.nn as nnfrom mindspore import ParameterTuplefrom mindspore import contextfrom mindspore.nn import Cellimport mindspore.ops as opsfrom mindspore.train.dataset_helper import DatasetHelperfrom mindspore.train.serialization import save_checkpointfrom model_zoo.official.cv.lenet.src.dataset import create_datasetfrom model_zoo.official.cv.lenet.src.lenet import LeNet5
加载数据集
利用MindSpore的dataset提供的MnistDataset接口加载MNIST数据集,此部分代码由model_zoo中lenet目录下的dataset.py导入。
定义网络
这里以LeNet网络为例进行介绍,当然也可以使用其它的网络,如ResNet-50、BERT等, 此部分代码由model_zoo中lenet目录下的lenet.py导入。
定义训练模型
将训练流程拆分为正向反向训练、参数更新和累积梯度清理三个部分:
TrainForwardBackward计算loss和梯度,利用grad_sum实现梯度累加。TrainOptim实现参数更新。TrainClear实现对梯度累加变量grad_sum清零。
_sum_op = ops.MultitypeFuncGraph("grad_sum_op")_clear_op = ops.MultitypeFuncGraph("clear_op")@_sum_op.register("Tensor", "Tensor")def _cumulative_gard(grad_sum, grad):"""Apply gard sum to cumulative gradient."""add = ops.AssignAdd()return add(grad_sum, grad)@_clear_op.register("Tensor", "Tensor")def _clear_grad_sum(grad_sum, zero):"""Apply zero to clear grad_sum."""success = Truesuccess = ops.depend(success, ops.assign(grad_sum, zero))return successclass TrainForwardBackward(Cell):def __init__(self, network, optimizer, grad_sum, sens=1.0):super(TrainForwardBackward, self).__init__(auto_prefix=False)self.network = networkself.network.set_grad()self.network.add_flags(defer_inline=True)self.weights = ParameterTuple(network.trainable_params())self.optimizer = optimizerself.grad_sum = grad_sumself.grad = ops.GradOperation(get_by_list=True, sens_param=True)self.sens = sensself.hyper_map = ops.HyperMap()def construct(self, *inputs):weights = self.weightsloss = self.network(*inputs)sens = ops.Fill()(ops.DType()(loss), ops.Shape()(loss), self.sens)grads = self.grad(self.network, weights)(*inputs, sens)return ops.depend(loss, self.hyper_map(ops.partial(_sum_op), self.grad_sum, grads))class TrainOptim(Cell):def __init__(self, optimizer, grad_sum):super(TrainOptim, self).__init__(auto_prefix=False)self.optimizer = optimizerself.grad_sum = grad_sumdef construct(self):return self.optimizer(self.grad_sum)class TrainClear(Cell):def __init__(self, grad_sum, zeros):super(TrainClear, self).__init__(auto_prefix=False)self.grad_sum = grad_sumself.zeros = zerosself.hyper_map = ops.HyperMap()def construct(self):success = self.hyper_map(ops.partial(_clear_op), self.grad_sum, self.zeros)return success
定义训练过程
每个Mini-batch通过正反向训练计算loss和梯度,通过mini_steps控制每次更新参数前的累加次数。达到累加次数后进行参数更新和 累加梯度变量清零。
class GradientAccumulation:def __init__(self, network, loss_fn, optimizer):self._network = networkself._loss_fn = loss_fnself._optimizer = optimizerparams = self._optimizer.parametersself._grad_sum = params.clone(prefix="grad_sum", init='zeros')self._zeros = params.clone(prefix="zeros", init='zeros')self._train_forward_backward = self._build_train_forward_backward_network()self._train_optim = self._build_train_optim()self._train_clear = self._build_train_clear()@staticmethoddef _transform_callbacks(callbacks):"""Transform callback to a list."""if callbacks is None:return []if isinstance(callbacks, Iterable):return list(callbacks)return [callbacks]def _build_train_forward_backward_network(self):"""Build forward and backward network"""network = self._networknetwork = nn.WithLossCell(network, self._loss_fn)loss_scale = 1.0network = TrainForwardBackward(network, self._optimizer, self._grad_sum, loss_scale).set_train()return networkdef _build_train_optim(self):"""Build optimizer network"""network = TrainOptim(self._optimizer, self._grad_sum).set_train()return networkdef _build_train_clear(self):"""Build clear network"""network = TrainClear(self._grad_sum, self._zeros).set_train()return networkdef train_process(self, epoch, train_dataset, mini_steps=None):"""Training process. The data would be passed to network directly."""dataset_helper = DatasetHelper(train_dataset, dataset_sink_mode=False, epoch_num=epoch)for i in range(epoch):step = 0for k, next_element in enumerate(dataset_helper):loss = self._train_forward_backward(*next_element)if (k + 1) % mini_steps == 0:step += 1print("epoch:", i + 1, "step:", step, "loss is ", loss)self._train_optim()self._train_clear()train_dataset.reset()save_checkpoint(self._train_forward_backward, "gradient_accumulation.ckpt", )
训练并保存模型
调用网络、优化器及损失函数,然后自定义GradientAccumulation的train_process接口,进行模型训练。
if __name__ == "__main__":parser = argparse.ArgumentParser(description='MindSpore Gard Cumulative Example')parser.add_argument('--device_target', type=str, default="GPU", choices=['GPU'],help='device where the code will be implemented (default: GPU)')parser.add_argument('--data_path', type=str, default="./Data",help='path where the dataset is saved')args = parser.parse_args()context.set_context(mode=context.GRAPH_MODE, device_target=args.device_target)ds_train = create_dataset(os.path.join(args.data_path, "train"), 32)network = LeNet5(10)net_loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction="mean")net_opt = nn.Momentum(network.trainable_params(), 0.01, 0.9)model = GradientAccumulation(network, net_loss, net_opt)print("============== Starting Training ==============")model.train_process(10, ds_train, mini_steps=4)
实验结果
在经历了10轮epoch之后,在测试集上的精度约为96.31%。
执行训练
运行训练代码,查看运行结果。
python train.py --data_path=./MNIST_Data
输出如下,可以看到loss值随着训练逐步降低:
epoch: 1 step: 27 loss is 0.3660637epoch: 1 step: 28 loss is 0.25238192...epoch: 3 step: 2 loss is 0.12296932epoch: 3 step: 3 loss is 0.15799297...epoch: 10 step: 448 loss is 0.06443884epoch: 10 step: 449 loss is 0.0067842817
查看保存的CheckPoint文件。
训练过程中保存了CheckPoint文件
gradient_accumulation.ckpt,即模型文件。
验证模型
通过model_zoo中lenet目录下的eval.py,使用保存的CheckPoint文件,加载验证数据集,进行验证。
python eval.py --data_path=./MNIST_Data --ckpt_path=./gradient_accumulation.ckpt --device_target=GPU
输出如下,可以看到使用验证的数据集,正确率在96.31%左右,与batch_size为32的验证结果一致。
============== Starting Testing ============================ {'Accuracy': 0.9631730769230769} ==============
