Description

Imputer completes missing values in a dataSet, but only same type of columns can be selected at the same time. Imputer Train will train a model for predict. Strategy support min, max, mean or value. If min, will replace missing value with min of the column. If max, will replace missing value with max of the column. If mean, will replace missing value with mean of the column. If value, will replace missing value with the value.

Parameters

Name Description Type Required? Default Value
strategy the startegy to fill missing value, support mean, max, min or value String “mean”
fillValue fill all missing values with fillValue String null
selectedCols Names of the columns used for processing String[]

Script Example

  1. data = np.array([
  2. ["a", 10.0, 100],
  3. ["b", -2.5, 9],
  4. ["c", 100.2, 1],
  5. ["d", -99.9, 100],
  6. ["a", 1.4, 1],
  7. ["b", -2.2, 9],
  8. ["c", 100.9, 1],
  9. [None, None, None]
  10. ])
  11. colnames = ["col1", "col2", "col3"]
  12. selectedColNames = ["col2", "col3"]
  13. df = pd.DataFrame({"col1": data[:, 0], "col2": data[:, 1], "col3": data[:, 2]})
  14. inOp = dataframeToOperator(df, schemaStr='col1 string, col2 double, col3 long', op_type='batch')
  15. # train
  16. trainOp = ImputerTrainBatchOp()\
  17. .setSelectedCols(selectedColNames)
  18. trainOp.linkFrom(inOp)
  19. # batch predict
  20. predictOp = ImputerPredictBatchOp()
  21. predictOp.linkFrom(trainOp, inOp).print()
  22. # stream predict
  23. sinOp = dataframeToOperator(df, schemaStr='col1 string, col2 double, col3 long', op_type='stream')
  24. predictStreamOp = MaxAbsScalerPredictStreamOp(trainOp)
  25. predictStreamOp.linkFrom(sinOp).print()
  26. StreamOperator.execute()

Results

  1. col1 col2 col3
  2. 0 a 10.000000 100
  3. 1 b -2.500000 9
  4. 2 c 100.200000 1
  5. 3 d -99.900000 100
  6. 4 a 1.400000 1
  7. 5 b -2.200000 9
  8. 6 c 100.900000 1
  9. 7 None 15.414286 31