

Name Description Type Required? Default Value
selectorType The selector supports different selection methods: numTopFeatures, percentile, fpr,fdr, fwe String “numTopFeatures”
numTopFeatures Number of features that selector will select, ordered by ascending p-value. If the number of features is < numTopFeatures, then this will select all features. By default, 50 Integer 50
percentile Percentile of features that selector will select, ordered by ascending p-value. It must be in range (0,1) By default, 0.1 Double 0.1
fpr The highest p-value for features to be kept. It must be in range (0,1) By default, 0.05 Double 0.05
fdr The upper bound of the expected false discovery rate.It must be in range (0,1) By default, 0.05 Double 0.05
fwe The upper bound of the expected family-wise error rate. rate.It must be in range (0,1) By default, 0.05 Double 0.05
selectedCols Names of the columns used for processing String[]
labelCol Name of the label column in the input table String

options for the selectorType

  • numTopFeatures chooses a fixed number of top features according to a chi-squared test.
  • percentile is similar but chooses a fraction of all features instead of a fixed number.
  • fpr chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.
  • fdr uses the [Benjamini-Hochberg procedure] ( to choose all features whose false discovery rate is below a threshold.
  • fwe chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection. By default, the selection method is numTopFeatures, with the default number of top features

    Script Example


  1. data = np.array([
  2. ["a", 1, 1,2.0, True],
  3. ["c", 1, 2, -3.0, True],
  4. ["a", 2, 2,2.0, False],
  5. ["c", 0, 0, 0.0, False]
  6. ])
  7. df = pd.DataFrame({"f_string": data[:, 0], "f_long": data[:, 1], "f_int": data[:, 2], "f_double": data[:, 3], "f_boolean": data[:, 4]})
  8. source = dataframeToOperator(df, schemaStr='f_string string, f_long long, f_int int, f_double double, f_boolean boolean', op_type="batch")
  9. selector = ChiSqSelectorBatchOp()\
  10. .setSelectedCols(["f_string", "f_long", "f_int", "f_double"])\
  11. .setLabelCol("f_boolean")\
  12. .setNumTopFeatures(2)
  13. selector.linkFrom(source)
  14. selectedColNames = selector.collectResult()
  15. print(selectedColNames)


  1. ['f_string', 'f_long']