Description

Encode several columns of strings to bigint type indices. The indices are consecutive bigint type that start from 0. Non-string columns are first converted to strings and then encoded. Each columns are encoded separately.

Several string order type is supported, including:

  1. random
  2. frequency_asc
  3. frequency_desc
  4. alphabet_asc
  5. alphabet_desc

Parameters

Name Description Type Required? Default Value
selectedCols Names of the columns used for processing String[]
stringOrderType String order type, one of “random”, “frequency_asc”, “frequency_desc”, “alphabet_asc”, “alphabet_desc”. String “random”

Script Example

Code

  1. data = np.array([
  2. ["football"],
  3. ["football"],
  4. ["football"],
  5. ["basketball"],
  6. ["basketball"],
  7. ["tennis"],
  8. ])
  9. df_data = pd.DataFrame({
  10. "f0": data[:, 0],
  11. })
  12. data = dataframeToOperator(df_data, schemaStr='f0 string', op_type="batch")
  13. stringindexer = MultiStringIndexerTrainBatchOp() \
  14. .setSelectedCols(["f0"]) \
  15. .setStringOrderType("frequency_asc")
  16. model = stringindexer.linkFrom(data)
  17. model.print()

Results

Model:

  1. column_index token token_index
  2. 0 -1 {"selectedCols":"[\"f0\"]","selectedColTypes":... NaN
  3. 1 0 tennis 0.0
  4. 2 0 basketball 1.0
  5. 3 0 football 2.0