Description

Map string to index based on the model generated by {@link MultiStringIndexerTrainBatchOp}.

Parameters

Name Description Type Required? Default Value
handleInvalid Strategy to handle unseen token when doing prediction, one of “keep”, “skip” or “error” String “keep”
selectedCols Names of the columns used for processing String[]
reservedCols Names of the columns to be retained in the output table String[] null
outputCols Names of the output columns String[] null

Script Example

Code

  1. data = np.array([
  2. ["football"],
  3. ["football"],
  4. ["football"],
  5. ["basketball"],
  6. ["basketball"],
  7. ["tennis"],
  8. ])
  9. df_data = pd.DataFrame({
  10. "f0": data[:, 0],
  11. })
  12. data = dataframeToOperator(df_data, schemaStr='f0 string', op_type='stream')
  13. stringindexer = MultiStringIndexerTrainBatchOp() \
  14. .setSelectedCols(["f0"]) \
  15. .setStringOrderType("frequency_asc")
  16. predictor = MultiStringIndexerPredictStreamOp().setSelectedCols(["f0"]).setOutputCols(["f0_indexed"])
  17. model = stringindexer.linkFrom(data)
  18. predictor.linkFrom(model, data).print()

Results

  1. f0 f0_indexed
  2. 0 football 2
  3. 1 football 2
  4. 2 football 2
  5. 3 basketball 1
  6. 4 basketball 1
  7. 5 tennis 0