Scikit Transformers Examples

Here we outline some more complicated transformers and transformers that require additional processing to work nicely with Pipelines and Feature Unions.

Label Encoder

The LabelEncoder is synonymous with StringIndexer in Spark, however there are a couple of unique features of the scikit transformer that we need to account for:

  1. LabelEncoder only opperates on a single feature at a time
  2. The output of the LabelEncoder is a numpy array of shape (1,n) instead of (n,1), which is required for further processing like One-Hot-Encoding

Here is what an example Pipeline looks like for a LabelEncoder

  1. # Create a dataframe with some a categorical and a continuous feature
  2. df = pd.DataFrame(np.array([ ['Alice', 32], ['Jack', 18], ['Bob',34]]), columns=['name', 'age'])
  3. # Define our feature extractor
  4. feature_extractor_tf = FeatureExtractor(input_scalars=['name'],
  5. output_vector='name_continuous_feature',
  6. output_vector_items=['name_label_encoded'])
  7. # Label Encoder for x1 Label
  8. label_encoder_tf = LabelEncoder()
  9. label_encoder_tf.mlinit(input_features = feature_extractor_tf.output_vector_items, output_features='name_label_le')
  10. # Reshape the output of the LabelEncoder to N-by-1 array
  11. reshape_le_tf = ReshapeArrayToN1()
  12. # Create our pipeline object and initialize MLeap Serialization
  13. le_pipeline = Pipeline([(feature_extractor_tf.name, feature_extractor_tf),
  14. (label_encoder_tf.name, label_encoder_tf),
  15. (reshape_le_tf.name, reshape_le_tf)
  16. ])
  17. le_pipeline.mlinit()
  18. # Transform our DataFrame
  19. le_pipeline.fit_transform(df)
  20. # output
  21. array([[0],
  22. [2],
  23. [1]])

Next step is to combine the label indexer with a OneHotEncoder

Scikit OneHotEncoder

We’ll continue the example above to demonstrate how the out-of-the-box Scikit OneHotEncoder works.

  1. ## Vector Assembler for x1 One Hot Encoder
  2. one_hot_encoder_tf = OneHotEncoder(sparse=False) # Make sure to set sparse=False
  3. one_hot_encoder_tf.mlinit(prior_tf=label_encoder_tf, output_features = '{}_one_hot_encoded'.format(label_encoder_tf.output_features))
  4. #
  5. # Construct our pipeline
  6. one_hot_encoder_pipeline_x0 = Pipeline([
  7. (feature_extractor_tf.name, feature_extractor_tf),
  8. (label_encoder_tf.name, label_encoder_tf),
  9. (reshape_le_tf.name, reshape_le_tf),
  10. (one_hot_encoder_tf.name, one_hot_encoder_tf)
  11. ])
  12. one_hot_encoder_pipeline_x0.mlinit()
  13. # Execute our LabelEncoder + OneHotEncoder pipeline on our dataframe
  14. one_hot_encoder_pipeline_x0.fit_transform(df)
  15. matrix([[ 1., 0., 0.],
  16. [ 0., 0., 1.],
  17. [ 0., 1., 0.]])

One of the short-comings of Scikit’s OneHotEncoder is that it’s missing a drop_last functionality that’s required in ML pipelines.MLeap comes with it’s own OneHotEncoder that enables that function

MLeap OneHotEncoder

Very similar to the Scikit OneHotEncoder, except we set an additional drop_last attribute.

  1. from mleap.sklearn.extensions.data import OneHotEncoder
  2. ## Vector Assembler for x1 One Hot Encoder
  3. one_hot_encoder_tf = OneHotEncoder(sparse=False, drop_last=True) # Make sure to set sparse=False
  4. one_hot_encoder_tf.mlinit(prior_tf=label_encoder_tf, output_features = '{}_one_hot_encoded'.format(label_encoder_tf.output_features))
  5. #
  6. # Construct our pipeline
  7. one_hot_encoder_pipeline_x0 = Pipeline([
  8. (feature_extractor_tf.name, feature_extractor_tf),
  9. (label_encoder_tf.name, label_encoder_tf),
  10. (reshape_le_tf.name, reshape_le_tf),
  11. (one_hot_encoder_tf.name, one_hot_encoder_tf)
  12. ])
  13. one_hot_encoder_pipeline_x0.mlinit()
  14. # Execute our LabelEncoder + OneHotEncoder pipeline on our dataframe
  15. one_hot_encoder_pipeline_x0.fit_transform(df)
  16. matrix([[ 1., 0.],
  17. [ 0., 0.],
  18. [ 0., 1.]])