Description

Encode several columns of strings to bigint type indices. The indices are consecutive bigint type that start from 0. Non-string columns are first converted to strings and then encoded. Each columns are encoded separately.

Several string order type is supported, including:

  1. random
  2. frequency_asc
  3. frequency_desc
  4. alphabet_asc
  5. alphabet_desc

Parameters

Name Description Type Required? Default Value
handleInvalid Strategy to handle unseen token when doing prediction, one of “keep”, “skip” or “error” String “keep”
selectedCols Names of the columns used for processing String[]
stringOrderType String order type, one of “random”, “frequency_asc”, “frequency_desc”, “alphabet_asc”, “alphabet_desc”. String “random”
selectedCols Names of the columns used for processing String[]
reservedCols Names of the columns to be retained in the output table String[] null
outputCols Names of the output columns String[] null

Script Example

Code

  1. data = np.array([
  2. ["football"],
  3. ["football"],
  4. ["football"],
  5. ["basketball"],
  6. ["basketball"],
  7. ["tennis"],
  8. ])
  9. df_data = pd.DataFrame({
  10. "f0": data[:, 0],
  11. })
  12. data = dataframeToOperator(df_data, schemaStr='f0 string', op_type='batch')
  13. stringindexer = MultiStringIndexer() \
  14. .setSelectedCols(["f0"]) \
  15. .setOutputCols(["f0_indexed"]) \
  16. .setStringOrderType("frequency_asc")
  17. stringindexer.fit(data).transform(data).print()

Results

Model:

  1. f0 f0_indexed
  2. 0 football 2
  3. 1 football 2
  4. 2 football 2
  5. 3 basketball 1
  6. 4 basketball 1
  7. 5 tennis 0