Description
Map string to index based on the model generated by {@link MultiStringIndexerTrainBatchOp}.
Parameters
Name | Description | Type | Required? | Default Value |
---|---|---|---|---|
handleInvalid | Strategy to handle unseen token when doing prediction, one of “keep”, “skip” or “error” | String | “keep” | |
selectedCols | Names of the columns used for processing | String[] | ✓ | |
reservedCols | Names of the columns to be retained in the output table | String[] | null | |
outputCols | Names of the output columns | String[] | null |
Script Example
Code
data = np.array([
["football"],
["football"],
["football"],
["basketball"],
["basketball"],
["tennis"],
])
df_data = pd.DataFrame({
"f0": data[:, 0],
})
data = dataframeToOperator(df_data, schemaStr='f0 string', op_type="batch")
stringindexer = MultiStringIndexerTrainBatchOp() \
.setSelectedCols(["f0"]) \
.setStringOrderType("frequency_asc")
predictor = MultiStringIndexerPredictBatchOp().setSelectedCols(["f0"]).setOutputCols(["f0_indexed"])
model = stringindexer.linkFrom(data)
predictor.linkFrom(model, data).print()
Results
f0 f0_indexed
0 football 2
1 football 2
2 football 2
3 basketball 1
4 basketball 1
5 tennis 0