Deep And Wide

1. 算法介绍

Deep and Wide算法是将Embedding的结果直接输入DNN进一步提取高阶特特交叉, 最后将一阶特征与高阶特征组合起来进行预测, 其构架如下:
DNN

1.1 Deep and Wide中的层

  • SparseInputLayer: 稀疏数据输入层, 对稀疏高维数据做了特别优化, 本质上是一个FCLayer
  • Embedding: 隐式嵌入层, 如果特征非one-hot, 则乘以特征值
  • FCLayer: DNN中最常见的层, 线性变换后接传递函数
  • SumPooling: 将多个输入的数据做element-wise的加和, 要求输入具本相同的shape
  • SimpleLossLayer: 损失层, 可以指定不同的损失函数

1.3 网络构建

  1. override def buildNetwork(): Unit = {
  2. val wide = new SparseInputLayer("input", 1, new Identity(),
  3. JsonUtils.getOptimizerByLayerType(jsonAst, "SparseInputLayer"))
  4. val embeddingParams = JsonUtils.getLayerParamsByLayerType(jsonAst, "Embedding")
  5. .asInstanceOf[EmbeddingParams]
  6. val embedding = new Embedding("embedding", embeddingParams.outputDim, embeddingParams.numFactors,
  7. embeddingParams.optimizer.build()
  8. )
  9. val hiddenLayer = JsonUtils.getFCLayer(jsonAst, embedding)
  10. val join = new SumPooling("sumPooling", 1, Array[Layer](wide, hiddenLayer))
  11. new SimpleLossLayer("simpleLossLayer", join, lossFunc)
  12. }

2. 运行与性能

2.1 Json配置文件说明

Deep and wide的参数较多, 需要用Json配置文件的方式指定(关于Json配置文件的完整说明请参考Json说明), 一个典型的例子如下:

  1. {
  2. "data": {
  3. "format": "dummy",
  4. "indexrange": 148,
  5. "numfield": 13,
  6. "validateratio": 0.1
  7. },
  8. "model": {
  9. "modeltype": "T_DOUBLE_SPARSE_LONGKEY",
  10. "modelsize": 148
  11. },
  12. "train": {
  13. "epoch": 10,
  14. "numupdateperepoch": 10,
  15. "lr": 0.1,
  16. "decay": 0.8
  17. },
  18. "default_optimizer": {
  19. "type": "momentum",
  20. "momentum": 0.9,
  21. "reg2": 0.01
  22. },
  23. "layers": [
  24. {
  25. "name": "wide",
  26. "type": "sparseinputlayer",
  27. "outputdim": 1,
  28. "transfunc": "identity"
  29. },
  30. {
  31. "name": "embedding",
  32. "type": "embedding",
  33. "numfactors": 8,
  34. "outputdim": 104
  35. },
  36. {
  37. "name": "fclayer",
  38. "type": "FCLayer",
  39. "inputlayer": "embedding",
  40. "outputdims": [
  41. 100,
  42. 100,
  43. 1
  44. ],
  45. "transfuncs": [
  46. "relu",
  47. "relu",
  48. "identity"
  49. ]
  50. },
  51. {
  52. "name": "sumPooling",
  53. "type": "SumPooling",
  54. "outputdim": 1,
  55. "inputlayers": [
  56. "wide",
  57. "fclayer"
  58. ]
  59. },
  60. {
  61. "name": "simplelosslayer",
  62. "type": "simplelosslayer",
  63. "lossfunc": "logloss",
  64. "inputlayer": "sumPooling"
  65. }
  66. ]
  67. }

2.2 提交脚本说明

  1. runner="com.tencent.angel.ml.core.graphsubmit.GraphRunner"
  2. modelClass="com.tencent.angel.ml.classification.WideAndDeep"
  3. $ANGEL_HOME/bin/angel-submit \
  4. --angel.job.name DeepFM \
  5. --action.type train \
  6. --angel.app.submit.class $runner \
  7. --ml.model.class.name $modelClass \
  8. --angel.train.data.path $input_path \
  9. --angel.workergroup.number $workerNumber \
  10. --angel.worker.memory.gb $workerMemory \
  11. --angel.ps.number $PSNumber \
  12. --angel.ps.memory.gb $PSMemory \
  13. --angel.task.data.storage.level $storageLevel \
  14. --angel.task.memorystorage.max.gb $taskMemory

对深度学习模型, 其数据, 训练和网络的配置请优先使用Json文件指定.