模型保存及加载

作者: PaddlePaddle

日期: 2021.01

摘要: 本教程将基于Paddle高阶API对模型参数的保存和加载进行讲解。

一、简介

在日常训练模型过程中我们会遇到一些突发情况,导致训练过程主动或被动的中断,因此在模型没有完全训练好的情况下,我们需要高频的保存下模型参数,在发生意外时可以快速载入保存的参数继续训练;抑或是模型已经训练好了,我们需要使用训练好的参数进行预测或部署模型上线。面对上述情况,Paddle中提供了保存模型和提取模型的方法,支持从上一次保存状态开始训练,只要我们随时保存训练过程中的模型状态,就不用从初始状态重新训练。 下面将基于手写数字识别的模型讲解paddle如何保存及加载模型,并恢复训练,网络结构部分的讲解省略。

二、环境配置

本教程基于Paddle 2.0 编写,如果您的环境不是本版本,请先参考官网安装 Paddle 2.0 。

  1. import paddle
  2. import paddle.nn.functional as F
  3. from paddle.nn import Layer
  4. from paddle.vision.datasets import MNIST
  5. from paddle.metric import Accuracy
  6. from paddle.nn import Conv2D,MaxPool2D,Linear
  7. from paddle.static import InputSpec
  8. from paddle.vision.transforms import ToTensor
  9. print(paddle.__version__)
  1. 2.0.0

三、数据集

手写数字的MNIST数据集,包含60,000个用于训练的示例和10,000个用于测试的示例。这些数字已经过尺寸标准化并位于图像中心,图像是固定大小(28x28像素),其值为0到1。该数据集的官方地址为:http://yann.lecun.com/exdb/mnist/ 本例中我们使用飞桨自带的mnist数据集。使用from paddle.vision.datasets import MNIST 引入即可。

  1. train_dataset = MNIST(mode='train', transform=ToTensor())
  2. test_dataset = MNIST(mode='test', transform=ToTensor())

四、模型组建

  1. class MyModel(Layer):
  2. def __init__(self):
  3. super(MyModel, self).__init__()
  4. self.conv1 = paddle.nn.Conv2D(in_channels=1, out_channels=6, kernel_size=5, stride=1, padding=2)
  5. self.max_pool1 = MaxPool2D(kernel_size=2, stride=2)
  6. self.conv2 = Conv2D(in_channels=6, out_channels=16, kernel_size=5, stride=1)
  7. self.max_pool2 = MaxPool2D(kernel_size=2, stride=2)
  8. self.linear1 = Linear(in_features=16*5*5, out_features=120)
  9. self.linear2 = Linear(in_features=120, out_features=84)
  10. self.linear3 = Linear(in_features=84, out_features=10)
  11. def forward(self, x):
  12. x = self.conv1(x)
  13. x = F.relu(x)
  14. x = self.max_pool1(x)
  15. x = F.relu(x)
  16. x = self.conv2(x)
  17. x = self.max_pool2(x)
  18. x = paddle.flatten(x, start_axis=1, stop_axis=-1)
  19. x = self.linear1(x)
  20. x = F.relu(x)
  21. x = self.linear2(x)
  22. x = F.relu(x)
  23. x = self.linear3(x)
  24. return x

五、模型训练

通过Model 构建实例,快速完成模型训练

  1. inputs = InputSpec([None, 784], 'float32', 'x')
  2. labels = InputSpec([None, 10], 'float32', 'x')
  3. model = paddle.Model(MyModel(), inputs, labels)
  4. optim = paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters())
  5. model.prepare(
  6. optim,
  7. paddle.nn.CrossEntropyLoss(),
  8. Accuracy()
  9. )
  10. model.fit(train_dataset,
  11. test_dataset,
  12. epochs=3,
  13. batch_size=64,
  14. save_dir='mnist_checkpoint',
  15. verbose=1
  16. )
  1. The loss value printed in the log is the current step, and the metric is the average value of previous step.
  2. Epoch 1/3
  3. step 938/938 [==============================] - loss: 0.0398 - acc: 0.9435 - 20ms/step
  4. save checkpoint at /Users/tclong/online_repo/book/paddle2.0_docs/save_model/mnist_checkpoint/0
  5. Eval begin...
  6. The loss value printed in the log is the current batch, and the metric is the average value of previous step.
  7. step 157/157 [==============================] - loss: 0.0043 - acc: 0.9782 - 18ms/step
  8. Eval samples: 10000
  9. Epoch 2/3
  10. step 938/938 [==============================] - loss: 0.0340 - acc: 0.9818 - 22ms/step loss: 0.0559 - acc: 0.9
  11. save checkpoint at /Users/tclong/online_repo/book/paddle2.0_docs/save_model/mnist_checkpoint/1
  12. Eval begin...
  13. The loss value printed in the log is the current batch, and the metric is the average value of previous step.
  14. step 157/157 [==============================] - loss: 5.2083e-04 - acc: 0.9853 - 19ms/step
  15. Eval samples: 10000
  16. Epoch 3/3
  17. step 938/938 [==============================] - loss: 0.0706 - acc: 0.9868 - 27ms/step
  18. save checkpoint at /Users/tclong/online_repo/book/paddle2.0_docs/save_model/mnist_checkpoint/2
  19. Eval begin...
  20. The loss value printed in the log is the current batch, and the metric is the average value of previous step.
  21. step 157/157 [==============================] - loss: 5.4219e-04 - acc: 0.9882 - 19ms/step
  22. Eval samples: 10000
  23. save checkpoint at /Users/tclong/online_repo/book/paddle2.0_docs/save_model/mnist_checkpoint/final

六、保存模型参数

目前Paddle框架有三种保存模型参数的体系,分别是: #### paddle 高阶API-模型参数保存 * paddle.Model.fit * paddle.Model.save #### paddle 基础框架-动态图-模型参数保存 * paddle.save #### paddle 基础框架-静态图-模型参数保存 * paddle.static.save * paddle.static.save_inference_model

下面将基于高阶API对模型保存与加载的方法进行讲解。

方法一:

  • paddle.Model.fit(train_data, epochs, batch_size, save_dir, log_freq) 在使用model.fit函数进行网络循环训练时,在save_dir参数中指定保存模型的路径,save_freq指定写入频率,即可同时实现模型的训练和保存。mode.fit()只能保存模型参数,不能保存优化器参数,每个epoch结束只会生成一个.pdparams文件。可以边训练边保存,每次epoch结束会实时生成一个.pdparams文件。

方法二:

  • paddle.Model.save(self, path, training=True) model.save(path)方法可以保存模型结构、网络参数和优化器参数,参数training=true的使用场景是在训练过程中,此时会保存网络参数和优化器参数。每个epoch生成两种文件 0.pdparams,0.pdopt,分别存储了模型参数和优化器参数,但是只会在整个模型训练完成后才会生成包含所有epoch参数的文件,path的格式为’dirname/file_prefix’ 或 ‘file_prefix’,其中dirname指定路径名称,file_prefix 指定参数文件的名称。当training=false的时候,代表已经训练结束,此时存储的是预测模型结构和网络参数。
  1. # 方法一:训练过程中实时保存每个epoch的模型参数
  2. model.fit(train_dataset,
  3. test_dataset,
  4. epochs=2,
  5. batch_size=64,
  6. save_dir='mnist_checkpoint',
  7. verbose=1
  8. )
  1. The loss value printed in the log is the current step, and the metric is the average value of previous step.
  2. Epoch 1/2
  3. step 938/938 [==============================] - loss: 0.0023 - acc: 0.9898 - 21ms/step
  4. save checkpoint at /Users/tclong/online_repo/book/paddle2.0_docs/save_model/mnist_checkpoint/0
  5. Eval begin...
  6. The loss value printed in the log is the current batch, and the metric is the average value of previous step.
  7. step 157/157 [==============================] - loss: 7.4614e-05 - acc: 0.9869 - 19ms/step
  8. Eval samples: 10000
  9. Epoch 2/2
  10. step 938/938 [==============================] - loss: 0.0014 - acc: 0.9917 - 20ms/step
  11. save checkpoint at /Users/tclong/online_repo/book/paddle2.0_docs/save_model/mnist_checkpoint/1
  12. Eval begin...
  13. The loss value printed in the log is the current batch, and the metric is the average value of previous step.
  14. step 157/157 [==============================] - loss: 5.2536e-05 - acc: 0.9878 - 18ms/step
  15. Eval samples: 10000
  16. save checkpoint at /Users/tclong/online_repo/book/paddle2.0_docs/save_model/mnist_checkpoint/final
  1. # 方法二:model.save()保存模型和优化器参数信息
  2. model.save('mnist_checkpoint/test')

七、加载模型参数

当恢复训练状态时,需要加载模型数据,此时我们可以使用加载函数从存储模型状态和优化器状态的文件中载入模型参数和优化器参数,如果不需要恢复优化器,则不必使用优化器状态文件。 #### 高阶API-模型参数加载 * paddle.Model.load #### paddle 基础框架-动态图-模型参数加载 * paddle.load #### paddle 基础框架-静态图-模型参数加载 * paddle.io.load * paddle.io.load_inference_model

下面将对高阶API的模型参数加载方法进行讲解 * model.load(self, path, skip_mismatch=False, reset_optimizer=False) model.load能够同时加载模型和优化器参数。通过reset_optimizer参数来指定是否需要恢复优化器参数,若reset_optimizer参数为True,则重新初始化优化器参数,若reset_optimizer参数为False,则从路径中恢复优化器参数。

  1. # 高阶API加载模型
  2. model.load('mnist_checkpoint/test')

八、恢复训练

理想的恢复训练是模型状态回到训练中断的时刻,恢复训练之后的梯度更新走向是和恢复训练前的梯度走向完全相同的。基于此,我们可以通过恢复训练后的损失变化,判断上述方法是否能准确的恢复训练。即从epoch 0结束时保存的模型参数和优化器状态恢复训练,校验其后训练的损失变化(epoch 1)是否和不中断时的训练完全一致。

说明:

恢复训练有如下两个要点:

  • 保存模型时同时保存模型参数和优化器参数

  • 恢复参数时同时恢复模型参数和优化器参数。

  1. import paddle
  2. from paddle.vision.datasets import MNIST
  3. from paddle.metric import Accuracy
  4. from paddle.static import InputSpec
  5. train_dataset = MNIST(mode='train', transform=ToTensor())
  6. test_dataset = MNIST(mode='test', transform=ToTensor())
  7. inputs = InputSpec([None, 784], 'float32', 'inputs')
  8. labels = InputSpec([None, 10], 'float32', 'labels')
  9. model = paddle.Model(MyModel(), inputs, labels)
  10. optim = paddle.optimizer.Adam(learning_rate=0.001, parameters=model.parameters())
  11. model.load("./mnist_checkpoint/final")
  12. model.prepare(
  13. optim,
  14. paddle.nn.loss.CrossEntropyLoss(),
  15. Accuracy()
  16. )
  17. model.fit(train_data=train_dataset,
  18. eval_data=test_dataset,
  19. batch_size=64,
  20. epochs=2,
  21. verbose=1
  22. )
  1. The loss value printed in the log is the current step, and the metric is the average value of previous step.
  2. Epoch 1/2
  3. step 938/938 [==============================] - loss: 0.0118 - acc: 0.9922 - 20ms/step
  4. Eval begin...
  5. The loss value printed in the log is the current batch, and the metric is the average value of previous step.
  6. step 157/157 [==============================] - loss: 2.4631e-05 - acc: 0.9872 - 17ms/step
  7. Eval samples: 10000
  8. Epoch 2/2
  9. step 938/938 [==============================] - loss: 1.2774e-04 - acc: 0.9942 - 19ms/step
  10. Eval begin...
  11. The loss value printed in the log is the current batch, and the metric is the average value of previous step.
  12. step 157/157 [==============================] - loss: 1.3047e-05 - acc: 0.9882 - 18ms/step
  13. Eval samples: 10000

九、总结

以上就是用Mnist手写数字识别的例子对保存模型、加载模型、恢复训练进行讲解,Paddle提供了很多保存和加载的API方法,您可以根据自己的需求进行选择。