DeepFM

1. 算法介绍

DeepFM算法是在FM(Factorization machine)的基础上加入深度层构成. 与PNN, NFM算法相比, 它保留了FM的二阶隐式特征交叉的同时又用深度网络来获取高阶特征交叉. 其构架如下:

DeepFM

1.1 Embedding与BiInnerSumCross层的说明

与传统的FM实现不同, 这里采用Embedding与BiInnerSumCross结合的方式实现二阶隐式交叉, 传统的FM二次交叉项的表达式如下:

model^T(x_j\bold{v}_j)-\sum_i(x_i\bold{v}_i)^T(x_i\bold{v}_i)))

在实现中, 用Embedding的方式存储Deep Factorization Machine(DeepFM) - 图3, 调用Embedding的calOutput后, 将Deep Factorization Machine(DeepFM) - 图4计算后一起输出, 所以一个样本的Embedding output结果为:

model=(\bold{u}_1,\bold{u}_2,\bold{u}_3,\cdots,\bold{u}_k))

原始的二次交叉项的结为可重新表达为:

model^T(\sum_j\bold{u}_j)-\sum_i\bold{u}_i^T\bold{u}_i))

以上即是BiInnerSumCross的前向计算公式, 用Scala代码实现为:

  1. val sumVector = VFactory.denseDoubleVector(mat.getSubDim)
  2. (0 until batchSize).foreach { row =>
  3. val partitions = mat.getRow(row).getPartitions
  4. partitions.foreach { vectorOuter =>
  5. data(row) -= vectorOuter.dot(vectorOuter)
  6. sumVector.iadd(vectorOuter)
  7. }
  8. data(row) += sumVector.dot(sumVector)
  9. data(row) /= 2
  10. sumVector.clear()
  11. }

1.2 其它层说明

  • SimpleInputLayer: 稀疏数据输入层, 对稀疏高维数据做了特别优化, 本质上是一个FCLayer
  • FCLayer: DNN中最常见的层, 线性变换后接传递函数
  • SumPooling: 将多个输入的数据做element-wise的加和, 要求输入具本相同的shape
  • SimpleLossLayer: 损失层, 可以指定不同的损失函数

1.3 网络构建

  1. override def buildNetwork(): Unit = {
  2. ensureJsonAst()
  3. val wide = new SimpleInputLayer("input", 1, new Identity(),
  4. JsonUtils.getOptimizerByLayerType(jsonAst, "SparseInputLayer")
  5. )
  6. val embeddingParams = JsonUtils.getLayerParamsByLayerType(jsonAst, "Embedding")
  7. .asInstanceOf[EmbeddingParams]
  8. val embedding = new Embedding("embedding", embeddingParams.outputDim,
  9. embeddingParams.numFactors, embeddingParams.optimizer.build()
  10. )
  11. val innerSumCross = new BiInnerSumCross("innerSumPooling", embedding)
  12. val mlpLayer = JsonUtils.getFCLayer(jsonAst, embedding)
  13. val join = new SumPooling("sumPooling", 1, Array[Layer](wide, innerSumCross, mlpLayer))
  14. new SimpleLossLayer("simpleLossLayer", join, lossFunc)
  15. }

2. 运行与性能

2.1 Json配置文件说明

DeepFM的参数较多, 需要用Json配置文件的方式指定(关于Json配置文件的完整说明请参考Json说明), 一个典型的例子如下:

  1. {
  2. "data": {
  3. "format": "dummy",
  4. "indexrange": 148,
  5. "numfield": 13,
  6. "validateratio": 0.1,
  7. "sampleratio": 0.2
  8. },
  9. "model": {
  10. "modeltype": "T_DOUBLE_SPARSE_LONGKEY",
  11. "modelsize": 148
  12. },
  13. "train": {
  14. "epoch": 10,
  15. "numupdateperepoch": 10,
  16. "lr": 0.5,
  17. "decayclass": "StandardDecay",
  18. "decaybeta": 0.01
  19. },
  20. "default_optimizer": "Momentum",
  21. "layers": [
  22. {
  23. "name": "wide",
  24. "type": "simpleinputlayer",
  25. "outputdim": 1,
  26. "transfunc": "identity"
  27. },
  28. {
  29. "name": "embedding",
  30. "type": "embedding",
  31. "numfactors": 8,
  32. "outputdim": 104,
  33. "optimizer": {
  34. "type": "momentum",
  35. "momentum": 0.9,
  36. "reg2": 0.01
  37. }
  38. },
  39. {
  40. "name": "fclayer",
  41. "type": "FCLayer",
  42. "outputdims": [
  43. 100,
  44. 100,
  45. 1
  46. ],
  47. "transfuncs": [
  48. "relu",
  49. "relu",
  50. "identity"
  51. ],
  52. "inputlayer": "embedding"
  53. },
  54. {
  55. "name": "biinnersumcross",
  56. "type": "BiInnerSumCross",
  57. "inputlayer": "embedding",
  58. "outputdim": 1
  59. },
  60. {
  61. "name": "sumPooling",
  62. "type": "SumPooling",
  63. "outputdim": 1,
  64. "inputlayers": [
  65. "wide",
  66. "biinnersumcross",
  67. "fclayer"
  68. ]
  69. },
  70. {
  71. "name": "simplelosslayer",
  72. "type": "simplelosslayer",
  73. "lossfunc": "logloss",
  74. "inputlayer": "sumPooling"
  75. }
  76. ]
  77. }

2.2 提交脚本说明

  1. runner="com.tencent.angel.ml.core.graphsubmit.GraphRunner"
  2. modelClass="com.tencent.angel.ml.classification.DeepFM"
  3. $ANGEL_HOME/bin/angel-submit \
  4. --angel.job.name DeepFM \
  5. --action.type train \
  6. --angel.app.submit.class $runner \
  7. --ml.model.class.name $modelClass \
  8. --angel.train.data.path $input_path \
  9. --angel.workergroup.number $workerNumber \
  10. --angel.worker.memory.gb $workerMemory \
  11. --angel.ps.number $PSNumber \
  12. --angel.ps.memory.gb $PSMemory \
  13. --angel.task.data.storage.level $storageLevel \
  14. --angel.task.memorystorage.max.gb $taskMemory

对深度学习模型, 其数据, 训练和网络的配置请优先使用Json文件指定.