NFM

1. 算法介绍

NFM(Neural Factorization Machines)算法是在Embedding的基础上, 对Embedding的结果进行两两对应元素乘积, 然后相加, 得到一个与Embedding同维的向量, 然后输入DNN进一步提取高阶特特交叉. 值得注意的是, NFM也没有放弃一阶特征, 最后将一阶特征与高阶特征组合起来进行预测, 其构架如下:

NFM

1.1 BiInteractionCross层的说明

在实现中, 用Embedding的方式存储Neural Factorization Machine(NFM) - 图2, 调用Embedding的calOutput后, 将Neural Factorization Machine(NFM) - 图3计算后一起输出, 所以一个样本的Embedding output结果为:

model=(\bold{u}_1,\bold{u}_2,\bold{u}_3,\cdots,\bold{u}_k))

BiInteractionCross的计算公式如下:

model\\&=\frac{1}{2}(\sum_i\bold{u}_i)\otimes(\sum_j\bold{u}_j)-\sum_i\bold{u}_i^2\\&=\frac{1}{2}[(\sum_i\bold{u}_i)^2-\sum_i\bold{u}_i^2]\end{array})

用Scala代码实现为:

  1. val sum1Vector = VFactory.denseDoubleVector(outputDim)
  2. val sum2Vector = VFactory.denseDoubleVector(outputDim)
  3. (0 until batchSize).foreach { row =>
  4. mat.getRow(row).getPartitions.foreach { vectorOuter =>
  5. sum1Vector.iadd(vectorOuter)
  6. sum2Vector.iadd(vectorOuter.mul(vectorOuter))
  7. }
  8. blasMat.setRow(row, sum1Vector.imul(sum1Vector).isub(sum2Vector).imul(0.5))
  9. sum1Vector.clear()
  10. sum2Vector.clear()
  11. }

1.2 其它层说明

  • SparseInputLayer: 稀疏数据输入层, 对稀疏高维数据做了特别优化, 本质上是一个FCLayer
  • Embedding: 隐式嵌入层, 如果特征非one-hot, 则乘以特征值
  • FCLayer: DNN中最常见的层, 线性变换后接传递函数
  • SumPooling: 将多个输入的数据做element-wise的加和, 要求输入具本相同的shape
  • SimpleLossLayer: 损失层, 可以指定不同的损失函数

1.3 网络构建

  1. override def buildNetwork(): Unit = {
  2. val wide = new SparseInputLayer("input", 1, new Identity(),
  3. JsonUtils.getOptimizerByLayerType(jsonAst, "SparseInputLayer"))
  4. val embeddingParams = JsonUtils.getLayerParamsByLayerType(jsonAst, "Embedding")
  5. .asInstanceOf[EmbeddingParams]
  6. val embedding = new Embedding("embedding", embeddingParams.outputDim, embeddingParams.numFactors,
  7. embeddingParams.optimizer.build()
  8. )
  9. val interactionCross = new BiInteractionCross("BiInteractionCross", embeddingParams.numFactors, embedding)
  10. val hiddenLayer = JsonUtils.getFCLayer(jsonAst, interactionCross)
  11. val join = new SumPooling("sumPooling", 1, Array[Layer](wide, hiddenLayer))
  12. new SimpleLossLayer("simpleLossLayer", join, lossFunc)
  13. }

2. 运行与性能

2.1 Json配置文件说明

NFM的参数较多, 需要用Json配置文件的方式指定(关于Json配置文件的完整说明请参考Json说明), 一个典型的例子如下:

  1. {
  2. "data": {
  3. "format": "dummy",
  4. "indexrange": 148,
  5. "numfield": 13,
  6. "validateratio": 0.1
  7. },
  8. "model": {
  9. "modeltype": "T_FLOAT_SPARSE_LONGKEY",
  10. "modelsize": 148
  11. },
  12. "train": {
  13. "epoch": 10,
  14. "numupdateperepoch": 10,
  15. "lr": 0.01,
  16. "decay": 0.1
  17. },
  18. "default_optimizer": "Momentum",
  19. "layers": [
  20. {
  21. "name": "wide",
  22. "type": "sparseinputlayer",
  23. "outputdim": 1,
  24. "transfunc": "identity"
  25. },
  26. {
  27. "name": "embedding",
  28. "type": "embedding",
  29. "numfactors": 8,
  30. "outputdim": 104,
  31. "optimizer": {
  32. "type": "momentum",
  33. "momentum": 0.9,
  34. "reg2": 0.01
  35. }
  36. },
  37. {
  38. "name": "biinteractioncross",
  39. "type": "BiInteractionCross",
  40. "outputdim": 8,
  41. "inputlayer": "embedding"
  42. },
  43. {
  44. "name": "fclayer",
  45. "type": "FCLayer",
  46. "outputdims": [
  47. 50,
  48. 50,
  49. 1
  50. ],
  51. "transfuncs": [
  52. "relu",
  53. "relu",
  54. "identity"
  55. ],
  56. "inputlayer": "biinteractioncross"
  57. },
  58. {
  59. "name": "sumPooling",
  60. "type": "SumPooling",
  61. "outputdim": 1,
  62. "inputlayers": [
  63. "wide",
  64. "fclayer"
  65. ]
  66. },
  67. {
  68. "name": "simplelosslayer",
  69. "type": "simplelosslayer",
  70. "lossfunc": "logloss",
  71. "inputlayer": "sumPooling"
  72. }
  73. ]
  74. }

2.2 提交脚本说明

  1. runner="com.tencent.angel.ml.core.graphsubmit.GraphRunner"
  2. modelClass="com.tencent.angel.ml.classification.NeuralFactorizationMachines"
  3. $ANGEL_HOME/bin/angel-submit \
  4. --angel.job.name DeepFM \
  5. --action.type train \
  6. --angel.app.submit.class $runner \
  7. --ml.model.class.name $modelClass \
  8. --angel.train.data.path $input_path \
  9. --angel.workergroup.number $workerNumber \
  10. --angel.worker.memory.gb $workerMemory \
  11. --angel.ps.number $PSNumber \
  12. --angel.ps.memory.gb $PSMemory \
  13. --angel.task.data.storage.level $storageLevel \
  14. --angel.task.memorystorage.max.gb $taskMemory

对深度学习模型, 其数据, 训练和网络的配置请优先使用Json文件指定.