Data Processing - IndexToString - 《Alink v1.0.1 Document》

Description
Parameters
Script Example
- Code
- Results

Description

Maps columns of indices to strings, based on the model fitted by {@link StringIndexer}.

While {@link StringIndexerModel} maps string to index, IndexToString maps index to string. However, IndexToString does not have a corresponding {@link com.alibaba.alink.pipeline.EstimatorBase}. Instead, IndexToString uses model data in StringIndexerModel to perform predictions.

IndexToString use the name of the {@link StringIndexerModel} to get the model data. The referenced {@link StringIndexerModel} should be created before calling transform method.

A common use case is as follows:

StringIndexer stringIndexer = new StringIndexer() .setModelName(“name_a”) // The fitted StringIndexerModel will have name “name_a”. .setSelectedCol(…);


 StringIndexerModel model = stringIndexer.fit(…); // This model will have name “name_a”.
 IndexToString indexToString = new IndexToString() .setModelName(“name_a”) // Should match the name of one StringIndexerModel. .setSelectedCol(…) .setOutputCol(…);

indexToString.transform(…); // Will relies on a StringIndexerModel with name “name_a” to do transformation.

The reason we use model name registration mechanism here is to make possible stacking both StringIndexer and IndexToString into a {@link Pipeline}. For examples,

StringIndexer stringIndexer = new StringIndexer() .setModelName(“si_model_0”).setSelectedCol(“label”);


 MultilayerPerceptronClassifier mlpc = new MultilayerPerceptronClassifier() .setVectorCol(“features”).setLabelCol(“label”).setPredictionCol(“predicted_label”);
 IndexToString indexToString = new IndexToString() .setModelName(“si_model_0”).setSelectedCol(“predicted_label”);
 Pipeline pipeline = new Pipeline().add(stringIndexer).add(mlpc).add(indexToString);

pipeline.fit(…);

Parameters

Name	Description	Type	Required？	Default Value
modelName	Name of the model	String	✓
selectedCol	Name of the selected column used for processing	String	✓
reservedCols	Names of the columns to be retained in the output table	String[]		null
outputCol	Name of the output column	String		null

Script Example

Code

data = np.array([
    ["football"],
    ["football"],
    ["football"],
    ["basketball"],
    ["basketball"],
    ["tennis"],
])
df_data = pd.DataFrame({
    "f0": data[:, 0],
})
data = dataframeToOperator(df_data, schemaStr='f0 string', op_type="batch")
stringIndexer = StringIndexerTrainBatchOp() \
    .setModelName("string_indexer_model") \
    .setSelectedCol("f0") \
    .setStringOrderType("frequency_asc")
model = stringIndexer.linkFrom(data)
string2int = StringIndexerPredictBatchOp() \
    .setSelectedCol("f0").setOutputCol("f0_indexed")
indexed = string2int.linkFrom(model, data)
predictor = IndexToStringPredictBatchOp().setSelectedCol("f0_indexed").setOutputCol("f0_indxed_unindexed");
predictor.linkFrom(model, indexed).print()

Results

f0|f0_indexed|f0_indxed_unindexed
--|----------|-------------------
football|2|football
football|2|football
football|2|football
basketball|1|basketball
basketball|1|basketball
tennis|0|tennis