Description

Naive Bayes Text Classifier.

We support the multinomial Naive Bayes Text and multinomial Naive Bayes Text model, a probabilistic learning method. Here, feature values of train table must be nonnegative.

Details info of the algorithm: https://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html

Parameters

Name Description Type Required? Default Value
modelType model type : Multinomial or Bernoulli. String “Multinomial”
labelCol Name of the label column in the input table String
weightCol Name of the column indicating weight String null
vectorCol Name of a vector column String
smoothing the smoothing factor Double 1.0
vectorCol Name of a vector column String
predictionCol Column name of prediction. String
predictionDetailCol Column name of prediction result, it will include detailed info. String
reservedCols Names of the columns to be retained in the output table String[] null

Script Example

Script

  1. data = np.array([
  2. ["$31$0:1.0 1:1.0 2:1.0 30:1.0","1.0 1.0 1.0 1.0", '1'],
  3. ["$31$0:1.0 1:1.0 2:0.0 30:1.0","1.0 1.0 0.0 1.0", '1'],
  4. ["$31$0:1.0 1:0.0 2:1.0 30:1.0","1.0 0.0 1.0 1.0", '1'],
  5. ["$31$0:1.0 1:0.0 2:1.0 30:1.0","1.0 0.0 1.0 1.0", '1'],
  6. ["$31$0:0.0 1:1.0 2:1.0 30:0.0","0.0 1.0 1.0 0.0", '0'],
  7. ["$31$0:0.0 1:1.0 2:1.0 30:0.0","0.0 1.0 1.0 0.0", '0'],
  8. ["$31$0:0.0 1:1.0 2:1.0 30:0.0","0.0 1.0 1.0 0.0", '0']])
  9. dataSchema = ["sv", "dv", "label"]
  10. df = pd.DataFrame({"sv": data[:, 0], "dv": data[:, 1], "label": data[:, 2]})
  11. batchData = dataframeToOperator(df, schemaStr='sv string, dv string, label string', op_type='batch')
  12. model = NaiveBayesTextClassifier().setVectorCol("sv").setLabelCol("label").setReservedCols(["sv", "label"]).setPredictionCol("pred")
  13. model.fit(batchData).transform(batchData).print()

运行结果

sv label pred
“$31$0:1.0 1:1.0 2:1.0 30:1.0” 1 1
“$31$0:1.0 1:1.0 2:0.0 30:1.0” 1 1
“$31$0:1.0 1:0.0 2:1.0 30:1.0” 1 1
“$31$0:1.0 1:0.0 2:1.0 30:1.0” 1 1
“$31$0:0.0 1:1.0 2:1.0 30:0.0” 0 0
“$31$0:0.0 1:1.0 2:1.0 30:0.0” 0 0
“$31$0:0.0 1:1.0 2:1.0 30:0.0” 0 0