Description

Encode one column of strings to bigint type indices. The indices are consecutive bigint type that start from 0. Non-string columns are first converted to strings and then encoded.

Several string order type is supported, including:

  1. random
  2. frequency_asc
  3. frequency_desc
  4. alphabet_asc
  5. alphabet_desc

Parameters

Name Description Type Required? Default Value
modelName Name of the model String
selectedCol Name of the selected column used for processing String
stringOrderType String order type, one of “random”, “frequency_asc”, “frequency_desc”, “alphabet_asc”, “alphabet_desc”. String “random”

Script Example

Code

  1. data = np.array([
  2. ["football"],
  3. ["football"],
  4. ["football"],
  5. ["basketball"],
  6. ["basketball"],
  7. ["tennis"],
  8. ])
  9. df_data = pd.DataFrame({
  10. "f0": data[:, 0],
  11. })
  12. data = dataframeToOperator(df_data, schemaStr='f0 string', op_type="batch")
  13. stringindexer = StringIndexerTrainBatchOp() \
  14. .setSelectedCol("f0") \
  15. .setStringOrderType("frequency_asc")
  16. model = stringindexer.linkFrom(data)
  17. model.print()

Results

Model:

  1. token token_index
  2. 0 tennis 0
  3. 1 basketball 1
  4. 2 football 2