Description

Gradient Boosting(often abbreviated to GBDT or GBM) is a popular supervised learning model. It is the best off-the-shelf supervised learning model for a wide range of problems, especially problems with medium to large data size.

This implementation use histogram-based algorithm. See: “Mcrank: Learning to rank using multiple classification and gradient boosting”, Ping Li et al., NIPS 2007, for detail and experiments on histogram-based algorithm.

This implementation use layer-wise tree growing strategy, rather than leaf-wise tree growing strategy (like the one in “Lightgbm: A highly efficient gradient boosting decision tree”, Guolin Ke et al., NIPS 2017), because we found the former being faster in flink-based distributed computing environment.

This implementation use data-parallel algorithm. See: “A communication-efficient parallel algorithm for decision tree”, Qi Meng et al., NIPS 2016 for an introduction on data-parallel, feature-parallel, etc., algorithms to construct decision forests.

Parameters

Name Description Type Required? Default Value
predictionCol Column name of prediction. String
predictionDetailCol Column name of prediction result, it will include detailed info. String
reservedCols Names of the columns to be retained in the output table String[] null

Script Example

Script

  1. import numpy as np
  2. import pandas as pd
  3. from pyalink.alink import *
  4. def exampleData():
  5. return np.array([
  6. [1.0, "A", 0, 0, 0],
  7. [2.0, "B", 1, 1, 0],
  8. [3.0, "C", 2, 2, 1],
  9. [4.0, "D", 3, 3, 1]
  10. ])
  11. def sourceFrame():
  12. data = exampleData()
  13. return pd.DataFrame({
  14. "f0": data[:, 0],
  15. "f1": data[:, 1],
  16. "f2": data[:, 2],
  17. "f3": data[:, 3],
  18. "label": data[:, 4]
  19. })
  20. def batchSource():
  21. return dataframeToOperator(
  22. sourceFrame(),
  23. schemaStr='''
  24. f0 double,
  25. f1 string,
  26. f2 int,
  27. f3 int,
  28. label int
  29. ''',
  30. op_type='batch'
  31. )
  32. def streamSource():
  33. return dataframeToOperator(
  34. sourceFrame(),
  35. schemaStr='''
  36. f0 double,
  37. f1 string,
  38. f2 int,
  39. f3 int,
  40. label int
  41. ''',
  42. op_type='stream'
  43. )
  44. trainOp = (
  45. GbdtRegTrainBatchOp()
  46. .setLearningRate(1.0)
  47. .setNumTrees(3)
  48. .setMinSamplesPerLeaf(1)
  49. .setLabelCol('label')
  50. .setFeatureCols(['f0', 'f1', 'f2', 'f3'])
  51. )
  52. predictBatchOp = (
  53. GbdtRegPredictBatchOp()
  54. .setPredictionCol('pred')
  55. )
  56. (
  57. predictBatchOp
  58. .linkFrom(
  59. batchSource().link(trainOp),
  60. batchSource()
  61. )
  62. .print()
  63. )
  64. predictStreamOp = (
  65. GbdtRegPredictStreamOp(
  66. batchSource().link(trainOp)
  67. )
  68. .setPredictionCol('pred')
  69. )
  70. (
  71. predictStreamOp
  72. .linkFrom(
  73. streamSource()
  74. )
  75. .print()
  76. )
  77. StreamOperator.execute()

Result

Batch prediction

  1. f0 f1 f2 f3 label pred
  2. 0 1.0 A 0 0 0 0.0
  3. 1 2.0 B 1 1 0 0.0
  4. 2 3.0 C 2 2 1 1.0
  5. 3 4.0 D 3 3 1 1.0