Description

Encode one column of strings to bigint type indices. The indices are consecutive bigint type that start from 0. Non-string columns are first converted to strings and then encoded.

Several string order type is supported, including:

  1. random
  2. frequency_asc
  3. frequency_desc
  4. alphabet_asc
  5. alphabet_desc

Parameters

Name Description Type Required? Default Value
modelName Name of the model String
handleInvalid Strategy to handle unseen token when doing prediction, one of “keep”, “skip” or “error” String “keep”
selectedCol Name of the selected column used for processing String
stringOrderType String order type, one of “random”, “frequency_asc”, “frequency_desc”, “alphabet_asc”, “alphabet_desc”. String “random”
selectedCol Name of the selected column used for processing String
reservedCols Names of the columns to be retained in the output table String[] null
outputCol Name of the output column String null

Script Example

Code

  1. data = np.array([
  2. ["football"],
  3. ["football"],
  4. ["football"],
  5. ["basketball"],
  6. ["basketball"],
  7. ["tennis"],
  8. ])
  9. df_data = pd.DataFrame({
  10. "f0": data[:, 0],
  11. })
  12. data = dataframeToOperator(df_data, schemaStr='f0 string', op_type="batch")
  13. stringindexer = StringIndexer() \
  14. .setSelectedCol("f0") \
  15. .setOutputCol("f0_indexed") \
  16. .setStringOrderType("frequency_asc")
  17. stringindexer.fit(data).transform(data).print()

Results

  1. f0 f0_indexed
  2. 0 football 2
  3. 1 football 2
  4. 2 football 2
  5. 3 basketball 1
  6. 4 basketball 1
  7. 5 tennis 0