Data Processing - MultiColStringIndexer - 《Alink v1.0.1 Document》

Description
Parameters
Script Example
- Code
- Results

Description

Encode several columns of strings to bigint type indices. The indices are consecutive bigint type that start from 0. Non-string columns are first converted to strings and then encoded. Each columns are encoded separately.

Several string order type is supported, including:

random
frequency_asc
frequency_desc
alphabet_asc
alphabet_desc

Parameters

Name	Description	Type	Required？	Default Value
handleInvalid	Strategy to handle unseen token when doing prediction, one of “keep”, “skip” or “error”	String		“keep”
selectedCols	Names of the columns used for processing	String[]	✓
stringOrderType	String order type, one of “random”, “frequency_asc”, “frequency_desc”, “alphabet_asc”, “alphabet_desc”.	String		“random”
selectedCols	Names of the columns used for processing	String[]	✓
reservedCols	Names of the columns to be retained in the output table	String[]		null
outputCols	Names of the output columns	String[]		null

Script Example

Code

data = np.array([
    ["football"],
    ["football"],
    ["football"],
    ["basketball"],
    ["basketball"],
    ["tennis"],
])
df_data = pd.DataFrame({
    "f0": data[:, 0],
})
data = dataframeToOperator(df_data, schemaStr='f0 string', op_type='batch')
stringindexer = MultiStringIndexer() \
    .setSelectedCols(["f0"]) \
    .setOutputCols(["f0_indexed"]) \
    .setStringOrderType("frequency_asc")
stringindexer.fit(data).transform(data).print()

Results

Model：

           f0  f0_indexed
0    football           2
1    football           2
2    football           2
3  basketball           1
4  basketball           1
5      tennis           0