Description

Transfrom a document into a new document composed of all its ngrams. The document is splitted into an array of words by a word delimiter(default space). Through sliding the word array, we get all ngrams and each ngram is connected with a “_” character. All the ngrams are joined together with space in the new document.

Parameters

Name Description Type Required? Default Value
n NGram length Integer 2
selectedCol Name of the selected column used for processing String
outputCol Name of the output column String null
reservedCols Names of the columns to be retained in the output table String[] null

Script Example

Code

  1. data = np.array([
  2. [0, 'That is an English Book!'],
  3. [1, 'Do you like math?'],
  4. [2, 'Have a good day!']
  5. ])
  6. df = pd.DataFrame({"id": data[:, 0], "text": data[:, 1]})
  7. inOp1 = dataframeToOperator(df, schemaStr='id long, text string', op_type='batch')
  8. op = NGramBatchOp().setSelectedCol("text")
  9. print(BatchOperator.collectToDataframe(op.linkFrom(inOp1)))
  10. inOp2 = dataframeToOperator(df, schemaStr='id long, text string', op_type='stream')
  11. op = NGramStreamOp().setSelectedCol("text")
  12. op.linkFrom(inOp2).print()
  13. StreamOperator.execute()

Results

  1. id text
  2. 0 2 Have_a a_good good_day!
  3. 1 1 Do_you you_like like_math?
  4. 2 0 That_is is_an an_English English_Book!