Bruce Kuo

Love programming and think about interesting problems

Beautiful Machine Learning Pipeline with Scikit-Learn

Published May 01, 2019

Doing feature engineering is the most complex part when applying machine learning to your product. This note aims to give better manners when using scikit-learn to do feature engineering and machine learning based my personal experience.

Before introducing my strategies, let's review common feature engineering problems first:

handling missing values
nomarlization / standardization
featuree interaction
label encoding
one hot encoding

When starting feature engineering part in developing a machine learning model, we usually need to try many possible solutions and iterate different possible combinations of feature tricks quickly. There are many articles talking about how to do above feature engineering works. But when you want to apply different approaches to different features, you may write complicated code to do feature engineering. The code may contains multiple numpy / scipy transformation and feed back into scikit learn pipelines. It causes the code is not easy to maintain and hard to debug when problem occurs. In this article, I want to introduce multiple tricks in scikit-learn to build up a machine learning model pipeline that covers:

feature engineering on different columns
ensemble learning with customized transformers
deep learning API with complicated feature pipeline

We define some code snippets about input / output data here before we talk about the detail:

train_data = pd.read_csv("input_data.csv")
train_labels = pd.read_csv("input_labels.cs v")
predict_data = pd.read_csv("predict_data.csv")

Idea 1. Naive Feature Engineering

Let's see how to do simple feature engineering if you don't apply pipeline.

Code Example

pca_transform = PCA(n_components=10)
pca_transform.fit(train_data.values)
pca_transform_data = pca_transform.transform(train_data.values)

nmf_transform = NMF()
nmf_transform.fit(train_data.values)
nmf_transform_data = nmf_transform.transform(train_data.values)

union_data = np.hstack(nmf_transform_data, pca_transform_data)

model = RandomForestClassifier()
model.fit(union_data, train_labels.values)

pca_transform_predict_data = pca_transform.transform(predict_data.values)
nmf_transform_predict_data = nmf_transform.transform(predict_data.values)
union_predict_data = np.hstack(
  pca_transform_predict_data, nmf_transform_predict_data)
predictions = model.predict(union_predict_data)

As you can see, this is pretty intuitive implementation if you want to apply some feature engineering tricks on your data. But you can imagine that the code will grow into a messy monster if you apply many tricks and to different features.

Pros

Naive implementation

Cons

Need to care many details with numpy / scipy interface
Have many duplicate code to do similar things

Idea 2. Scikit Learn Model Pipeline

To make the whole operation more clean, scikit-learn provides pipeline API to let user create a machine learning pipeline without caring about detail stuffs.

Code Example

model_pipeline = Pipeline(steps=[
  ("dimension_reduction", PCA(n_components=10)),
  ("classifiers", RandomForestClassifier())
])

model_pipeline.fit(train_data.values, train_labels.values)
predictions = model_pipeline.predict(predict_data.values)

Pros

Get rid of handling details between two stages.
Code is easy to maintain

Cons

If you use this implementation, only apply 1 type of transformation to given features. But this is the first step to make your pipeline more elegant.

Idea 3. Feature Union with Pipeline

If you want to apply different feature processing features on your dataset. You can try Feature Union API. The API provides simple way to merge arrays from different types transformation. Here is the code snippets if you want to use it:

Code Example

model_pipeline = Pipeline(steps=[
  ("feature_union", FeatureUnion([
    ("pca", PCA(n_components=1)),
    ("svd", TruncatedSVD(n_components=2))
  ])),
  ("classifiers", RandomForestClassifier())
])

model_pipeline.fit(train_data.values, train_labels.values)
predictions = model_pipeline.predict(predict_data.values)

Pros

Use different feature transformer without seperating your code into several parts and compose them.

Cons

Cannot apply different transformation by different features
Cannot direct send pandas dataframe and use dict-like way to access data in your pipeline.

Idea 4. Idea 3 + Column Transformer

With Idea 3, you can easily implement your pipeline with different transformation. But there are two problems we mentioned above, we try to solve those problems and find a Column Transformer API after survey different materials. I pretty like this API because it makes you can simplify your pipeline like configuration and train / predict your data with a simple command.

Code Example

model_pipeline = Pipeline(steps=[
  ("features", FeatureUnion([
    (
      "numerical_features",
      ColumnTransformer([
        (
          "numerical",
          Pipeline(steps=[(
            "impute_stage",
            SimpleImputer(missing_values=np.nan, strategy="median",)
          )]),
          ["feature_1"]
        )
      ])
    ), (
      "categorical_features",
      ColumnTransformer([
        (
          "country_encoding",
          Pipeline(steps=[
            ("ohe", OneHotEncoder(handle_unknown="ignore")),
            ("reduction", NMF(n_components=8)),
          ]),
          ["country"],
        ),
      ])
    ), (
      "text_features",
      ColumnTransformer([
        (
          "title_vec",
          Pipeline(steps=[
            ("tfidf", TfidfVectorizer()),
            ("reduction", NMF(n_components=50)),
          ]),
          "title"
        )
      ])
    )
  ])),
  ("classifiers", RandomForestClassifier())
])

model_pipeline.fit(train_data, train_labels.values)
predictions = model_pipeline.predict(predict_data)

Pros

All data transformation can be integrated into a model pipeline and easy to maintain. You can separate differet types of data such as numerical data and categorical data and process them in different methods.

Cons

I can't find any difficulty if we used such kind of implementation on feature engineering.

More tricks in your pipeline

With above tricks, you can create a machine learning pipeline elegantly. Here I want to introduce some advanced tricks, it covers:

how to do stacking ensemble learning in your pipeline
how to integrate keras in your pipeline

Stacking ensemble methods in a pipeline

As you know, we usually want to use stacking method to avoid bias from one specific method. If you still a newbie of stacking learning, you can read this tutorial first. So when implementing stacking methods, the question is: how to make stacking method as one step in your pipeline? I read this material and the spirit to create the step is building a customized transformer class .
The implementation here is not the perfect one but a good starting material to let us expand.

Quickly adapt neural network model with Keras API

Keras Scikit-Learn API provides a simple way to let you integrate your neural network model with scikit learn API. You can quickly implement your keras model and integrate with your custom pipeline as one step in your pipeline object.

But there is a drawback is that the steps outside neural networks cannot be optimized by neural network. You still need to optimized your feature engineering part by yourself but it can handle data preprocessing part if you want to use neural network in your pipeline. Another point is that I haven't try to implement the pipeline if my neural network part is multiple input, so I have no idea about how to integrate multi-inputs neural network in a scikit-learn pipeline.

Conclusion

This is just simple introduction to give a thought how to do feature engineering in an elegant way. I believe there are still many awesome tricks can help us create machine learning pipelines with simple code. With survey the documentation and API design in scikit-learn, I enjoy their thoughts on machine learning development and think that is pretty worth to follow them.

Python Machine learning

Report

Enjoy this post? Give Bruce Kuo a like if it's helpful.

Bruce Kuo

Love programming and think about interesting problems

Discover and read more posts from Bruce Kuo

get started

5Replies

Iain MacCormick

6 years ago

One more quick question, a general question regarding sklearn pipelines rather than a specific problem I’ve encountered. Theoretically, can you use sklearn pipelines in combination with gridsearch to optimize on a certain strategy. For example with encoding strategies, is it possible to define in pipeline OrdinalEncoding OR TargetEncoding - and then run GridSearch to find the best encoding method for the problem? Or is it only possible to use pipelines to optimize the parameters within?

Don’t worry about giving a specific example, unless of course you have time. I just need to understand if it is plausible.

Prateek

4 years ago

You can define a step in pipleine and use gridsearc to set the step to None e.g. grid = {‘encoding_step’:[Encoder(*args), None]}

Pass this to gridsearch.

Iain MacCormick

6 years ago

Hi Bruce, thanks for the article. I’m new to sklearn pipelines and this was a really good introduction and overview. I am having some trouble with encoding using the method in this post. I was wondering if you would be able to help me. I wrote a question on stack overflow:

https://datascience.stackexchange.com/questions/61323/error-encoding-categorical-features-using-sklearn-pipelines

If you have a chance to take a look, feel free to answer on here on or stackoverflow!

Wade Leftwich

6 years ago

Great article, thanks for posting.

I am using a similar approach with a Keras model that does take multiple inputs, to accommodate ordinal-encoded category embeddings where one code might show up in several features. It goes something like this:

# splitx() is in a separate module, to make it available to pickled pipelines
def splitx(X, numlen):
    """ Split 2D np.array (observations x features) into a list of arrays.
    First all numeric features together, assume all numerics are in beginning cols.
    Then an array for each categorical feature, 1 col each.
    Used in Keras models with embeddings and multiple inputs.
    """
    L = [X[:, :numlen]]
    for i in range(numlen, X.shape[1]):
        L.append(X[:, i])
    return L

splitx_ft = FunctionTransformer(splitx, validate=True, kw_args={'numlen': len(numeric_atts)})

num_pipeline = Pipeline([
    ('num_selector', DataFrameSelector(numeric_atts)),
    ('std_scaler', StandardScaler())
])

cat_pipeline = Pipeline([
    ('cat_selector', DataFrameSelector(category_atts)),
    ('ordinal_encoder', make_cat_encoder(category_atts, encoding='ordinal', handle_unknown='error'))
])

fu_pipeline = FeatureUnion([
    ('num_pipeline', num_pipeline),
    ('cat_pipeline', cat_pipeline)
])

pipeline = Pipeline([
    ('fu_pipeline', fu_pipeline),
    ('splitx_ft', splitx_ft)
])

At the end of the pipeline you have a list of arrays to feed to the Keras model.

Bruce Kuo

6 years ago

Thanks Wade! Pretty interesting method, I will try it in my keras code!

Show more replies