Description

If gaps is true, it splits the document with the given pattern. If gaps is false, it extract the tokens matching the pattern.

Parameters

Name Description Type Required? Default Value
pattern If gaps is true, it’s used as a delimiter; If gaps is false, it’s used as a token String “\s+”
gaps If gaps is true, it splits the document with the given pattern. If gaps is false, it extract the tokens matching the pattern Boolean true
minTokenLength The minimum of token length. Integer 1
toLowerCase If true, transform all the words to lower case。 Boolean true
selectedCol Name of the selected column used for processing String
outputCol Name of the output column String null
reservedCols Names of the columns to be retained in the output table String[] null

Script Example

Code

  1. import numpy as np
  2. import pandas as pd
  3. data = np.array([
  4. [0, 'That is an English Book!'],
  5. [1, 'Do you like math?'],
  6. [2, 'Have a good day!']
  7. ])
  8. df = pd.DataFrame({"id": data[:, 0], "text": data[:, 1]})
  9. inOp1 = dataframeToOperator(df, schemaStr='id long, text string', op_type='batch')
  10. op = RegexTokenizer().setSelectedCol("text").setGaps(False).setToLowerCase(True).setOutputCol("token").setPattern("\\w+")
  11. op.transform(inOp1).print()

Results

  1. id text token
  2. 0 0 That is an English Book! that is an english book
  3. 1 2 Have a good day! have a good day
  4. 2 1 Do you like math? do you like math