Scikit-Learn Transformer 示例

接下来我们列出了一些更为复杂的 Transformer,以及一些需要额外处理来更好兼容 Pipeline 和特征联合的 Transformer。

Label Encoder

在 Spark 中,LabelEncoderStringIndexer 是等价的,但是在 Scikit-Learn 中,我们需要去考虑一些独有的特性:

  1. LabelEncoder 一次只能作用于单个特征
  2. LabelEncoder 的输出是一个 (1, n),而非 (n, 1) 的 numpy 数组,因此还需要进行例如 One-Hot-Encoding 之类的进一步处理。

下面是一个 LabelEncoder 的示例 Pipeline。

  1. # Create a dataframe with some a categorical and a continuous feature
  2. df = pd.DataFrame(np.array([ ['Alice', 32], ['Jack', 18], ['Bob',34]]), columns=['name', 'age'])
  3. # Define our feature extractor
  4. feature_extractor_tf = FeatureExtractor(input_scalars=['name'],
  5. output_vector='name_continuous_feature',
  6. output_vector_items=['name_label_encoded'])
  7. # Label Encoder for x1 Label
  8. label_encoder_tf = LabelEncoder()
  9. label_encoder_tf.mlinit(input_features = feature_extractor_tf.output_vector_items, output_features='name_label_le')
  10. # Reshape the output of the LabelEncoder to N-by-1 array
  11. reshape_le_tf = ReshapeArrayToN1()
  12. # Create our pipeline object and initialize MLeap Serialization
  13. le_pipeline = Pipeline([(feature_extractor_tf.name, feature_extractor_tf),
  14. (label_encoder_tf.name, label_encoder_tf),
  15. (reshape_le_tf.name, reshape_le_tf)
  16. ])
  17. le_pipeline.mlinit()
  18. # Transform our DataFrame
  19. le_pipeline.fit_transform(df)
  20. # output
  21. array([[0],
  22. [2],
  23. [1]])

接下来我们来结合 LabelIndexerOneHotEncoder

Scikit-Learn 中的 OneHotEncoder

我们继续来看 Scikit-Learn 自带的 OneHotEncoder 是如何运作的:

  1. ## Vector Assembler for x1 One Hot Encoder
  2. one_hot_encoder_tf = OneHotEncoder(sparse=False) # Make sure to set sparse=False
  3. one_hot_encoder_tf.mlinit(prior_tf=label_encoder_tf, output_features = '{}_one_hot_encoded'.format(label_encoder_tf.output_features))
  4. #
  5. # Construct our pipeline
  6. one_hot_encoder_pipeline_x0 = Pipeline([
  7. (feature_extractor_tf.name, feature_extractor_tf),
  8. (label_encoder_tf.name, label_encoder_tf),
  9. (reshape_le_tf.name, reshape_le_tf),
  10. (one_hot_encoder_tf.name, one_hot_encoder_tf)
  11. ])
  12. one_hot_encoder_pipeline_x0.mlinit()
  13. # Execute our LabelEncoder + OneHotEncoder pipeline on our dataframe
  14. one_hot_encoder_pipeline_x0.fit_transform(df)
  15. matrix([[ 1., 0., 0.],
  16. [ 0., 0., 1.],
  17. [ 0., 1., 0.]])

Scikit-Learn 中的 OneHotEncoder 的一个缺点是其缺失了 ML Pipeline 所要求的 drop_last 功能。

MLeap 带来的 OneHotEncoder 则提供了这个功能。

MLeap 的 OneHotEncoder

类似于 Scikit-Learn 的 OneHotEncoder,但是我们设置了一个额外的 drop_last 属性。

  1. from mleap.sklearn.extensions.data import OneHotEncoder
  2. ## Vector Assembler for x1 One Hot Encoder
  3. one_hot_encoder_tf = OneHotEncoder(sparse=False, drop_last=True) # Make sure to set sparse=False
  4. one_hot_encoder_tf.mlinit(prior_tf=label_encoder_tf, output_features = '{}_one_hot_encoded'.format(label_encoder_tf.output_features))
  5. #
  6. # Construct our pipeline
  7. one_hot_encoder_pipeline_x0 = Pipeline([
  8. (feature_extractor_tf.name, feature_extractor_tf),
  9. (label_encoder_tf.name, label_encoder_tf),
  10. (reshape_le_tf.name, reshape_le_tf),
  11. (one_hot_encoder_tf.name, one_hot_encoder_tf)
  12. ])
  13. one_hot_encoder_pipeline_x0.mlinit()
  14. # Execute our LabelEncoder + OneHotEncoder pipeline on our dataframe
  15. one_hot_encoder_pipeline_x0.fit_transform(df)
  16. matrix([[ 1., 0.],
  17. [ 0., 0.],
  18. [ 0., 1.]])