模型训练

首先需要定义好训练的参数,包括是否使用GPU、设置损失函数、选择优化器以及学习率等。 在本次实验中,由于数据较为简单,我们选择在CPU上训练,优化器使用Adam,学习率设置为0.01,一共训练5个epoch。

然而,针对推荐算法的网络,如何设计损失函数呢?在CV和NLP章节中我们了解,分类可以用交叉熵损失函数,损失函数的大小可以衡量出算法当前分类的准确性。在推荐算法中,没有一个准确的度量既能衡量推荐的好坏,并具备可导性质,又能监督神经网络的训练。在电影推荐中,可以作为标签的只有评分数据,因此,我们可以用评分数据作为监督信息,神经网络的输出作为预测值,使用均方差(Mean Square Error)损失函数去训练网络模型。

注:使用均方差损失函数即使用回归的方法完成模型训练。电影的评分数据只有5个,是否可以使用分类损失函数完成训练?事实上,评分数据应该是一个连续数据,比如,评分3和评分4是接近的,如果使用分类的方法,评分3和评分4是两个类别,容易割裂评分间的连续性。

整个训练过程和一般的模型训练大同小异,不再赘述。

  1. def train(model):
  2. # 配置训练参数
  3. use_gpu = False
  4. lr = 0.01
  5. Epoches = 10
  6. place = fluid.CUDAPlace(0) if use_gpu else fluid.CPUPlace()
  7. with fluid.dygraph.guard(place):
  8. # 启动训练
  9. model.train()
  10. # 获得数据读取器
  11. data_loader = model.train_loader
  12. # 使用adam优化器,学习率使用0.01
  13. opt = fluid.optimizer.Adam(learning_rate=lr, parameter_list=model.parameters())
  14. for epoch in range(0, Epoches):
  15. for idx, data in enumerate(data_loader()):
  16. # 获得数据,并转为动态图格式
  17. usr, mov, score = data
  18. usr_v = [dygraph.to_variable(var) for var in usr]
  19. mov_v = [dygraph.to_variable(var) for var in mov]
  20. scores_label = dygraph.to_variable(score)
  21. # 计算出算法的前向计算结果
  22. _, _, scores_predict = model(usr_v, mov_v)
  23. # 计算loss
  24. loss = fluid.layers.square_error_cost(scores_predict, scores_label)
  25. avg_loss = fluid.layers.mean(loss)
  26. if idx % 500 == 0:
  27. print("epoch: {}, batch_id: {}, loss is: {}".format(epoch, idx, avg_loss.numpy()))
  28. # 损失函数下降,并清除梯度
  29. avg_loss.backward()
  30. opt.minimize(avg_loss)
  31. model.clear_gradients()
  32. # 每个epoch 保存一次模型
  33. fluid.save_dygraph(model.state_dict(), './checkpoint/epoch'+str(epoch))
  1. # 启动训练
  2. with dygraph.guard():
  3. use_poster, use_mov_title, use_mov_cat, use_age_job = False, True, True, True
  4. model = Model('Recommend', use_poster, use_mov_title, use_mov_cat, use_age_job)
  5. train(model)
  1. ##Total dataset instances: 1000209
  2. ##MovieLens dataset information:
  3. usr num: 6040
  4. movies num: 3883
  5. epoch: 0, batch_id: 0, loss is: [10.873174]
  6. epoch: 0, batch_id: 500, loss is: [0.9738145]
  7. epoch: 0, batch_id: 1000, loss is: [0.7016272]
  8. epoch: 0, batch_id: 1500, loss is: [1.0097994]
  9. epoch: 0, batch_id: 2000, loss is: [0.8981987]
  10. epoch: 0, batch_id: 2500, loss is: [0.8226846]
  11. epoch: 0, batch_id: 3000, loss is: [0.7943625]
  12. epoch: 0, batch_id: 3500, loss is: [0.88057446]
  13. epoch: 1, batch_id: 0, loss is: [0.8270193]
  14. epoch: 1, batch_id: 500, loss is: [0.711991]
  15. epoch: 1, batch_id: 1000, loss is: [0.97378314]
  16. epoch: 1, batch_id: 1500, loss is: [0.8741553]
  17. epoch: 1, batch_id: 2000, loss is: [0.873245]
  18. epoch: 1, batch_id: 2500, loss is: [0.8631375]
  19. epoch: 1, batch_id: 3000, loss is: [0.88147044]
  20. epoch: 1, batch_id: 3500, loss is: [0.9457144]
  21. epoch: 2, batch_id: 0, loss is: [0.7810389]
  22. epoch: 2, batch_id: 500, loss is: [0.9161325]
  23. epoch: 2, batch_id: 1000, loss is: [0.85070896]
  24. epoch: 2, batch_id: 1500, loss is: [0.83222216]
  25. epoch: 2, batch_id: 2000, loss is: [0.82739747]
  26. epoch: 2, batch_id: 2500, loss is: [0.7739769]
  27. epoch: 2, batch_id: 3000, loss is: [0.7288972]
  28. epoch: 2, batch_id: 3500, loss is: [0.71740997]
  29. epoch: 3, batch_id: 0, loss is: [0.7740326]
  30. epoch: 3, batch_id: 500, loss is: [0.79047513]
  31. epoch: 3, batch_id: 1000, loss is: [0.7714803]
  32. epoch: 3, batch_id: 1500, loss is: [0.7388534]
  33. epoch: 3, batch_id: 2000, loss is: [0.8264959]
  34. epoch: 3, batch_id: 2500, loss is: [0.65038306]
  35. epoch: 3, batch_id: 3000, loss is: [0.9168469]
  36. epoch: 3, batch_id: 3500, loss is: [0.8613069]
  37. epoch: 4, batch_id: 0, loss is: [0.7578842]
  38. epoch: 4, batch_id: 500, loss is: [0.89679146]
  39. epoch: 4, batch_id: 1000, loss is: [0.674494]
  40. epoch: 4, batch_id: 1500, loss is: [0.7206632]
  41. epoch: 4, batch_id: 2000, loss is: [0.7801018]
  42. epoch: 4, batch_id: 2500, loss is: [0.8618671]
  43. epoch: 4, batch_id: 3000, loss is: [0.8478118]
  44. epoch: 4, batch_id: 3500, loss is: [1.0286447]
  45. epoch: 5, batch_id: 0, loss is: [0.7023648]
  46. epoch: 5, batch_id: 500, loss is: [0.8227848]
  47. epoch: 5, batch_id: 1000, loss is: [0.88415223]
  48. epoch: 5, batch_id: 1500, loss is: [0.78416216]
  49. epoch: 5, batch_id: 2000, loss is: [0.7939043]
  50. epoch: 5, batch_id: 2500, loss is: [0.7428185]
  51. epoch: 5, batch_id: 3000, loss is: [0.745026]
  52. epoch: 5, batch_id: 3500, loss is: [0.76115835]
  53. epoch: 6, batch_id: 0, loss is: [0.83740556]
  54. epoch: 6, batch_id: 500, loss is: [0.816216]
  55. epoch: 6, batch_id: 1000, loss is: [0.8149048]
  56. epoch: 6, batch_id: 1500, loss is: [0.8676525]
  57. epoch: 6, batch_id: 2000, loss is: [0.88345516]
  58. epoch: 6, batch_id: 2500, loss is: [0.7371645]
  59. epoch: 6, batch_id: 3000, loss is: [0.7923065]
  60. epoch: 6, batch_id: 3500, loss is: [1.0073752]
  61. epoch: 7, batch_id: 0, loss is: [0.8476094]
  62. epoch: 7, batch_id: 500, loss is: [1.0047569]
  63. epoch: 7, batch_id: 1000, loss is: [0.80412626]
  64. epoch: 7, batch_id: 1500, loss is: [0.939283]
  65. epoch: 7, batch_id: 2000, loss is: [0.6579713]
  66. epoch: 7, batch_id: 2500, loss is: [0.7478874]
  67. epoch: 7, batch_id: 3000, loss is: [0.78322697]
  68. epoch: 7, batch_id: 3500, loss is: [0.8548964]
  69. epoch: 8, batch_id: 0, loss is: [0.8920554]
  70. epoch: 8, batch_id: 500, loss is: [0.69566244]
  71. epoch: 8, batch_id: 1000, loss is: [0.94016606]
  72. epoch: 8, batch_id: 1500, loss is: [0.7755744]
  73. epoch: 8, batch_id: 2000, loss is: [0.8520398]
  74. epoch: 8, batch_id: 2500, loss is: [0.77818584]
  75. epoch: 8, batch_id: 3000, loss is: [0.78463334]
  76. epoch: 8, batch_id: 3500, loss is: [0.8538652]
  77. epoch: 9, batch_id: 0, loss is: [0.9502439]
  78. epoch: 9, batch_id: 500, loss is: [0.8200456]
  79. epoch: 9, batch_id: 1000, loss is: [0.8938134]
  80. epoch: 9, batch_id: 1500, loss is: [0.8098132]
  81. epoch: 9, batch_id: 2000, loss is: [0.87928975]
  82. epoch: 9, batch_id: 2500, loss is: [0.7887068]
  83. epoch: 9, batch_id: 3000, loss is: [0.93909657]
  84. epoch: 9, batch_id: 3500, loss is: [0.69399315]

从训练结果来看,loss保持在0.9左右就难以下降了,主要是因为使用的均方差loss,计算得到预测评分和真实评分的均方差,真实评分的数据是1-5之间的整数,评分数据较大导致计算出来的loss也偏大。

不过不用担心,我们只是通过训练神经网络提取特征向量,loss只要收敛即可。

对训练的模型在验证集上做评估,除了训练所使用的Loss之外,还有两个选择:

  • 评分预测精度ACC(Accuracy):将预测的float数字转成整数,计算和真实评分的匹配度。评分误差在0.5分以内的算正确,否则算错误。
  • 评分预测误差(Mean Absolut Error)MAE:计算和真实评分之间的平均绝对误差。

下面是使用训练集评估这两个指标的代码实现。

  1. def evaluation(model, params_file_path):
  2. use_gpu = False
  3. place = fluid.CUDAPlace(0) if use_gpu else fluid.CPUPlace()
  4. with fluid.dygraph.guard(place):
  5. model_state_dict, _ = fluid.load_dygraph(params_file_path)
  6. model.load_dict(model_state_dict)
  7. model.eval()
  8. acc_set = []
  9. avg_loss_set = []
  10. for idx, data in enumerate(model.valid_loader()):
  11. usr, mov, score_label = data
  12. usr_v = [dygraph.to_variable(var) for var in usr]
  13. mov_v = [dygraph.to_variable(var) for var in mov]
  14. _, _, scores_predict = model(usr_v, mov_v)
  15. pred_scores = scores_predict.numpy()
  16. avg_loss_set.append(np.mean(np.abs(pred_scores - score_label)))
  17. diff = np.abs(pred_scores - score_label)
  18. diff[diff>0.5] = 1
  19. acc = 1 - np.mean(diff)
  20. acc_set.append(acc)
  21. return np.mean(acc_set), np.mean(avg_loss_set)
  1. param_path = "./checkpoint/epoch"
  2. for i in range(10):
  3. acc, mae = evaluation(model, param_path+str(i))
  4. print("ACC:", acc, "MAE:", mae)
  1. ACC: 0.2805188926366659 MAE: 0.7952824
  2. ACC: 0.2852882689390427 MAE: 0.7941532
  3. ACC: 0.2824734888015649 MAE: 0.79572767
  4. ACC: 0.2776615373599224 MAE: 0.80148673
  5. ACC: 0.2799660603205363 MAE: 0.8010404
  6. ACC: 0.2806148324257288 MAE: 0.8026996
  7. ACC: 0.2807383934656779 MAE: 0.80340725
  8. ACC: 0.2749944688417973 MAE: 0.80362296
  9. ACC: 0.280727839240661 MAE: 0.80528593
  10. ACC: 0.2924909143111645 MAE: 0.79743403

上述结果中,我们采用了ACC和MAE指标测试在验证集上的评分预测的准确性,其中ACC值越大越好,MAE值越小越好。

可以看到ACC和MAE的值不是很理想,但是这仅仅是对于评分预测不准确,不能直接衡量推荐结果的准确性。考虑到我们设计的神经网络是为了完成推荐任务而不是评分任务,所以总结一下:
1. 只针对预测评分任务来说,我们设计的神经网络结构和损失函数是不合理的,导致评分预测不理想;
2. 从损失函数的收敛可以知道网络的训练是有效的。评分预测的好坏不能反应推荐结果的好坏。

到这里,我们已经完成了推荐算法的前三步,包括:1. 数据的准备,2. 神经网络的设计,3. 神经网络的训练。

目前还需要完成剩余的两个步骤:1. 提取用户、电影数据的特征并保存到本地, 2. 利用保存的特征计算相似度矩阵,利用相似度完成推荐。

下面,我们利用训练的神经网络提取数据的特征,进而完成电影推荐,并观察推荐结果是否令人满意。