6.1. Pipelines and composite estimators

Transformers are usually combined with classifiers, regressors or otherestimators to build a composite estimator. The most common tool is aPipeline. Pipeline is often used in combination withFeatureUnion which concatenates the output oftransformers into a composite feature space. TransformedTargetRegressor deals with transforming the target(i.e. log-transform y). In contrast, Pipelines only transform theobserved data (X).

6.1.1. Pipeline: chaining estimators

Pipeline can be used to chain multiple estimatorsinto one. This is useful as there is often a fixed sequenceof steps in processing the data, for example feature selection, normalizationand classification. Pipeline serves multiple purposes here:

  • Convenience and encapsulation
  • You only have to call fit and predict once on yourdata to fit a whole sequence of estimators.

  • Joint parameter selection

  • You can grid searchover parameters of all estimators in the pipeline at once.

  • Safety

  • Pipelines help avoid leaking statistics from your test data into thetrained model in cross-validation, by ensuring that the same samples areused to train the transformers and predictors.

All estimators in a pipeline, except the last one, must be transformers(i.e. must have a transform method).The last estimator may be any type (transformer, classifier, etc.).

6.1.1.1. Usage

6.1.1.1.1. Construction

The Pipeline is built using a list of (key, value) pairs, wherethe key is a string containing the name you want to give this step and valueis an estimator object:

>>>

  1. >>> from sklearn.pipeline import Pipeline
  2. >>> from sklearn.svm import SVC
  3. >>> from sklearn.decomposition import PCA
  4. >>> estimators = [('reduce_dim', PCA()), ('clf', SVC())]
  5. >>> pipe = Pipeline(estimators)
  6. >>> pipe
  7. Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC())])

The utility function make_pipeline is a shorthandfor constructing pipelines;it takes a variable number of estimators and returns a pipeline,filling in the names automatically:

>>>

  1. >>> from sklearn.pipeline import make_pipeline
  2. >>> from sklearn.naive_bayes import MultinomialNB
  3. >>> from sklearn.preprocessing import Binarizer
  4. >>> make_pipeline(Binarizer(), MultinomialNB())
  5. Pipeline(steps=[('binarizer', Binarizer()), ('multinomialnb', MultinomialNB())])

6.1.1.1.2. Accessing steps

The estimators of a pipeline are stored as a list in the steps attribute,but can be accessed by index or name by indexing (with [idx]) thePipeline:

>>>

  1. >>> pipe.steps[0]
  2. ('reduce_dim', PCA())
  3. >>> pipe[0]
  4. PCA()
  5. >>> pipe['reduce_dim']
  6. PCA()

Pipeline’s named_steps attribute allows accessing steps by name with tabcompletion in interactive environments:

>>>

  1. >>> pipe.named_steps.reduce_dim is pipe['reduce_dim']
  2. True

A sub-pipeline can also be extracted using the slicing notation commonly usedfor Python Sequences such as lists or strings (although only a step of 1 ispermitted). This is convenient for performing only some of the transformations(or their inverse):

>>>

  1. >>> pipe[:1]
  2. Pipeline(steps=[('reduce_dim', PCA())])
  3. >>> pipe[-1:]
  4. Pipeline(steps=[('clf', SVC())])

6.1.1.1.3. Nested parameters

Parameters of the estimators in the pipeline can be accessed using the<estimator>__<parameter> syntax:

>>>

  1. >>> pipe.set_params(clf__C=10)
  2. Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC(C=10))])

This is particularly important for doing grid searches:

>>>

  1. >>> from sklearn.model_selection import GridSearchCV
  2. >>> param_grid = dict(reduce_dim__n_components=[2, 5, 10],
  3. ... clf__C=[0.1, 10, 100])
  4. >>> grid_search = GridSearchCV(pipe, param_grid=param_grid)

Individual steps may also be replaced as parameters, and non-final steps may beignored by setting them to 'passthrough':

>>>

  1. >>> from sklearn.linear_model import LogisticRegression
  2. >>> param_grid = dict(reduce_dim=['passthrough', PCA(5), PCA(10)],
  3. ... clf=[SVC(), LogisticRegression()],
  4. ... clf__C=[0.1, 10, 100])
  5. >>> grid_search = GridSearchCV(pipe, param_grid=param_grid)

The estimators of the pipeline can be retrieved by index:

>>>

  1. >>> pipe[0]
  2. PCA()

or by name:

>>>

  1. >>> pipe['reduce_dim']
  2. PCA()

Examples:

See also:

6.1.1.2. Notes

Calling fit on the pipeline is the same as calling fit oneach estimator in turn, transform the input and pass it on to the next step.The pipeline has all the methods that the last estimator in the pipeline has,i.e. if the last estimator is a classifier, the Pipeline can be usedas a classifier. If the last estimator is a transformer, again, so is thepipeline.

6.1.1.3. Caching transformers: avoid repeated computation

Fitting transformers may be computationally expensive. With itsmemory parameter set, Pipeline will cache each transformerafter calling fit.This feature is used to avoid computing the fit transformers within a pipelineif the parameters and input data are identical. A typical example is the case ofa grid search in which the transformers can be fitted only once and reused foreach configuration.

The parameter memory is needed in order to cache the transformers.memory can be either a string containing the directory where to cache thetransformers or a joblib.Memoryobject:

>>>

  1. >>> from tempfile import mkdtemp
  2. >>> from shutil import rmtree
  3. >>> from sklearn.decomposition import PCA
  4. >>> from sklearn.svm import SVC
  5. >>> from sklearn.pipeline import Pipeline
  6. >>> estimators = [('reduce_dim', PCA()), ('clf', SVC())]
  7. >>> cachedir = mkdtemp()
  8. >>> pipe = Pipeline(estimators, memory=cachedir)
  9. >>> pipe
  10. Pipeline(memory=...,
  11. steps=[('reduce_dim', PCA()), ('clf', SVC())])
  12. >>> # Clear the cache directory when you don't need it anymore
  13. >>> rmtree(cachedir)

Warning

Side effect of caching transformers

Using a Pipeline without cache enabled, it is possible toinspect the original instance such as:

>>>

  1. >>> from sklearn.datasets import load_digits
  2. >>> X_digits, y_digits = load_digits(return_X_y=True)
  3. >>> pca1 = PCA()
  4. >>> svm1 = SVC()
  5. >>> pipe = Pipeline([('reduce_dim', pca1), ('clf', svm1)])
  6. >>> pipe.fit(X_digits, y_digits)
  7. Pipeline(steps=[('reduce_dim', PCA()), ('clf', SVC())])
  8. >>> # The pca instance can be inspected directly
  9. >>> print(pca1.components_)
  10. [[-1.77484909e-19 ... 4.07058917e-18]]

Enabling caching triggers a clone of the transformers before fitting.Therefore, the transformer instance given to the pipeline cannot beinspected directly.In following example, accessing the PCA instance pca2will raise an AttributeError since pca2 will be an unfittedtransformer.Instead, use the attribute named_steps to inspect estimators withinthe pipeline:

>>>

  1. >>> cachedir = mkdtemp()
  2. >>> pca2 = PCA()
  3. >>> svm2 = SVC()
  4. >>> cached_pipe = Pipeline([('reduce_dim', pca2), ('clf', svm2)],
  5. ... memory=cachedir)
  6. >>> cached_pipe.fit(X_digits, y_digits)
  7. Pipeline(memory=...,
  8. steps=[('reduce_dim', PCA()), ('clf', SVC())])
  9. >>> print(cached_pipe.named_steps['reduce_dim'].components_)
  10. [[-1.77484909e-19 ... 4.07058917e-18]]
  11. >>> # Remove the cache directory
  12. >>> rmtree(cachedir)

Examples:

6.1.2. Transforming target in regression

TransformedTargetRegressor transforms thetargets y before fitting a regression model. The predictions are mappedback to the original space via an inverse transform. It takes as an argumentthe regressor that will be used for prediction, and the transformer that willbe applied to the target variable:

>>>

  1. >>> import numpy as np
  2. >>> from sklearn.datasets import load_boston
  3. >>> from sklearn.compose import TransformedTargetRegressor
  4. >>> from sklearn.preprocessing import QuantileTransformer
  5. >>> from sklearn.linear_model import LinearRegression
  6. >>> from sklearn.model_selection import train_test_split
  7. >>> X, y = load_boston(return_X_y=True)
  8. >>> transformer = QuantileTransformer(output_distribution='normal')
  9. >>> regressor = LinearRegression()
  10. >>> regr = TransformedTargetRegressor(regressor=regressor,
  11. ... transformer=transformer)
  12. >>> X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
  13. >>> regr.fit(X_train, y_train)
  14. TransformedTargetRegressor(...)
  15. >>> print('R2 score: {0:.2f}'.format(regr.score(X_test, y_test)))
  16. R2 score: 0.67
  17. >>> raw_target_regr = LinearRegression().fit(X_train, y_train)
  18. >>> print('R2 score: {0:.2f}'.format(raw_target_regr.score(X_test, y_test)))
  19. R2 score: 0.64

For simple transformations, instead of a Transformer object, a pair offunctions can be passed, defining the transformation and its inverse mapping:

>>>

  1. >>> def func(x):
  2. ... return np.log(x)
  3. >>> def inverse_func(x):
  4. ... return np.exp(x)

Subsequently, the object is created as:

>>>

  1. >>> regr = TransformedTargetRegressor(regressor=regressor,
  2. ... func=func,
  3. ... inverse_func=inverse_func)
  4. >>> regr.fit(X_train, y_train)
  5. TransformedTargetRegressor(...)
  6. >>> print('R2 score: {0:.2f}'.format(regr.score(X_test, y_test)))
  7. R2 score: 0.65

By default, the provided functions are checked at each fit to be the inverse ofeach other. However, it is possible to bypass this checking by settingcheck_inverse to False:

>>>

  1. >>> def inverse_func(x):
  2. ... return x
  3. >>> regr = TransformedTargetRegressor(regressor=regressor,
  4. ... func=func,
  5. ... inverse_func=inverse_func,
  6. ... check_inverse=False)
  7. >>> regr.fit(X_train, y_train)
  8. TransformedTargetRegressor(...)
  9. >>> print('R2 score: {0:.2f}'.format(regr.score(X_test, y_test)))
  10. R2 score: -4.50

Note

The transformation can be triggered by setting either transformer or thepair of functions func and inverse_func. However, setting bothoptions will raise an error.

Examples:

6.1.3. FeatureUnion: composite feature spaces

FeatureUnion combines several transformer objects into a newtransformer that combines their output. A FeatureUnion takesa list of transformer objects. During fitting, each of theseis fit to the data independently. The transformers are applied in parallel,and the feature matrices they output are concatenated side-by-side into alarger matrix.

When you want to apply different transformations to each field of the data,see the related class sklearn.compose.ColumnTransformer(see user guide).

FeatureUnion serves the same purposes as Pipeline -convenience and joint parameter estimation and validation.

FeatureUnion and Pipeline can be combined tocreate complex models.

(A FeatureUnion has no way of checking whether two transformersmight produce identical features. It only produces a union when thefeature sets are disjoint, and making sure they are the caller’sresponsibility.)

6.1.3.1. Usage

A FeatureUnion is built using a list of (key, value) pairs,where the key is the name you want to give to a given transformation(an arbitrary string; it only serves as an identifier)and value is an estimator object:

>>>

  1. >>> from sklearn.pipeline import FeatureUnion
  2. >>> from sklearn.decomposition import PCA
  3. >>> from sklearn.decomposition import KernelPCA
  4. >>> estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
  5. >>> combined = FeatureUnion(estimators)
  6. >>> combined
  7. FeatureUnion(transformer_list=[('linear_pca', PCA()),
  8. ('kernel_pca', KernelPCA())])

Like pipelines, feature unions have a shorthand constructor calledmake_union that does not require explicit naming of the components.

Like Pipeline, individual steps may be replaced using set_params,and ignored by setting to 'drop':

>>>

  1. >>> combined.set_params(kernel_pca='drop')
  2. FeatureUnion(transformer_list=[('linear_pca', PCA()),
  3. ('kernel_pca', 'drop')])

Examples:

6.1.4. ColumnTransformer for heterogeneous data

Warning

The compose.ColumnTransformerclass is experimental and the API is subject to change.

Many datasets contain features of different types, say text, floats, and dates,where each type of feature requires separate preprocessing or featureextraction steps. Often it is easiest to preprocess data before applyingscikit-learn methods, for example using pandas.Processing your data before passing it to scikit-learn might be problematic forone of the following reasons:

  • Incorporating statistics from test data into the preprocessors makescross-validation scores unreliable (known as data leakage),for example in the case of scalers or imputing missing values.

  • You may want to include the parameters of the preprocessors in aparameter search.

The ColumnTransformer helps performing differenttransformations for different columns of the data, within aPipeline that is safe from data leakage and that canbe parametrized. ColumnTransformer works onarrays, sparse matrices, andpandas DataFrames.

To each column, a different transformation can be applied, such aspreprocessing or a specific feature extraction method:

>>>

  1. >>> import pandas as pd
  2. >>> X = pd.DataFrame(
  3. ... {'city': ['London', 'London', 'Paris', 'Sallisaw'],
  4. ... 'title': ["His Last Bow", "How Watson Learned the Trick",
  5. ... "A Moveable Feast", "The Grapes of Wrath"],
  6. ... 'expert_rating': [5, 3, 4, 5],
  7. ... 'user_rating': [4, 5, 4, 3]})

For this data, we might want to encode the 'city' column as a categoricalvariable using preprocessing.OneHotEncoder but apply afeature_extraction.text.CountVectorizer to the 'title' column.As we might use multiple feature extraction methods on the same column, we giveeach transformer a unique name, say 'city_category' and 'title_bow'.By default, the remaining rating columns are ignored (remainder='drop'):

>>>

  1. >>> from sklearn.compose import ColumnTransformer
  2. >>> from sklearn.feature_extraction.text import CountVectorizer
  3. >>> from sklearn.preprocessing import OneHotEncoder
  4. >>> column_trans = ColumnTransformer(
  5. ... [('city_category', OneHotEncoder(dtype='int'),['city']),
  6. ... ('title_bow', CountVectorizer(), 'title')],
  7. ... remainder='drop')
  8.  
  9. >>> column_trans.fit(X)
  10. ColumnTransformer(transformers=[('city_category', OneHotEncoder(dtype='int'),
  11. ['city']),
  12. ('title_bow', CountVectorizer(), 'title')])
  13.  
  14. >>> column_trans.get_feature_names()
  15. ['city_category__x0_London', 'city_category__x0_Paris', 'city_category__x0_Sallisaw',
  16. 'title_bow__bow', 'title_bow__feast', 'title_bow__grapes', 'title_bow__his',
  17. 'title_bow__how', 'title_bow__last', 'title_bow__learned', 'title_bow__moveable',
  18. 'title_bow__of', 'title_bow__the', 'title_bow__trick', 'title_bow__watson',
  19. 'title_bow__wrath']
  20.  
  21. >>> column_trans.transform(X).toarray()
  22. array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
  23. [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0],
  24. [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
  25. [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]]...)

In the above example, theCountVectorizer expects a 1D array asinput and therefore the columns were specified as a string ('title').However, preprocessing.OneHotEncoderas most of other transformers expects 2D data, therefore in that case you needto specify the column as a list of strings (['city']).

Apart from a scalar or a single item list, the column selection can be specifiedas a list of multiple items, an integer array, a slice, a boolean mask, orwith a make_column_selector. Themake_column_selector is used to select columns basedon data type or column name:

>>>

  1. >>> from sklearn.preprocessing import StandardScaler
  2. >>> from sklearn.compose import make_column_selector
  3. >>> ct = ColumnTransformer([
  4. ... ('scale', StandardScaler(),
  5. ... make_column_selector(dtype_include=np.number)),
  6. ... ('onehot',
  7. ... OneHotEncoder(),
  8. ... make_column_selector(pattern='city', dtype_include=object))])
  9. >>> ct.fit_transform(X)
  10. array([[ 0.904..., 0. , 1. , 0. , 0. ],
  11. [-1.507..., 1.414..., 1. , 0. , 0. ],
  12. [-0.301..., 0. , 0. , 1. , 0. ],
  13. [ 0.904..., -1.414..., 0. , 0. , 1. ]])

Strings can reference columns if the input is a DataFrame, integers are alwaysinterpreted as the positional columns.

We can keep the remaining rating columns by settingremainder='passthrough'. The values are appended to the end of thetransformation:

>>>

  1. >>> column_trans = ColumnTransformer(
  2. ... [('city_category', OneHotEncoder(dtype='int'),['city']),
  3. ... ('title_bow', CountVectorizer(), 'title')],
  4. ... remainder='passthrough')
  5.  
  6. >>> column_trans.fit_transform(X)
  7. array([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 5, 4],
  8. [1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 3, 5],
  9. [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 4],
  10. [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 5, 3]]...)

The remainder parameter can be set to an estimator to transform theremaining rating columns. The transformed values are appended to the end ofthe transformation:

>>>

  1. >>> from sklearn.preprocessing import MinMaxScaler
  2. >>> column_trans = ColumnTransformer(
  3. ... [('city_category', OneHotEncoder(), ['city']),
  4. ... ('title_bow', CountVectorizer(), 'title')],
  5. ... remainder=MinMaxScaler())
  6.  
  7. >>> column_trans.fit_transform(X)[:, -2:]
  8. array([[1. , 0.5],
  9. [0. , 1. ],
  10. [0.5, 0.5],
  11. [1. , 0. ]])

The make_column_transformer function is availableto more easily create a ColumnTransformer object.Specifically, the names will be given automatically. The equivalent for theabove example would be:

>>>

  1. >>> from sklearn.compose import make_column_transformer
  2. >>> column_trans = make_column_transformer(
  3. ... (OneHotEncoder(), ['city']),
  4. ... (CountVectorizer(), 'title'),
  5. ... remainder=MinMaxScaler())
  6. >>> column_trans
  7. ColumnTransformer(remainder=MinMaxScaler(),
  8. transformers=[('onehotencoder', OneHotEncoder(), ['city']),
  9. ('countvectorizer', CountVectorizer(),
  10. 'title')])

Examples: