Codementor Events

One-Hot Encoding in Data Science

Published Jan 11, 2022Last updated Jul 09, 2022
One-Hot Encoding in Data Science

Categorical Data

Data processing is an important step in the machine learning process in order to transform the raw data to a useful and efficient features. When working with categorical data it’s necessary to convert it into a suitable form before feeding it to a machine learning model.

In statistics, a categorical variable (also called qualitative variable) is a variable that can take on one of a limited, and usually fixed, number of possible values

In this article, we will discover how to convert categorical data by the one-hot encoding in python using Pandas and Scikit-Learn.

One-Hot Encoding

One-Hot encoding is a vector representation where each category in the values set is converted to a binary feature containing 1 where the category is present in the current record and 0 otherwise.

For the sake of simplicity, I constructed a small dataset representing a list of cars.

import pandas as pd

df = pd.DataFrame({
    "name": ["Golf", "A3", "Leon", "Passat", "X6M"],
    "price": [32000, 38000, 28000, 36000, 75000],
    "brand": ["VW", "Audi", "Seat", "VW", "BMW"],
    "color": ["Black", "Blue", "Red", "Blue", "Black"]
    })

In this dataset, we have two categorical variables brand and color, each of them have a finite set of values {”VW”, “Audi”, “Seat”, “BMW”} and {”Black”, “Blue”, “Red”}
Screenshot 2022-01-11 004059.jpg
When working with actual data that contain huge number of rows, it would be helpful to check the possible values of a categorical column in a dataframe as follows:

brands = list(df["brand"].unique())
colors = list(df["color"].unique())
print("Brands labels:", brands)
print("Colors labels:", colors)

One-Hot Encoding with Pandas

One-Hot Encoding can be implemented with pandas using the get_dummies function that takes the following parameters (Learn more):

  • data: array-like, Series, or DataFrame — The data containing categorical variables of which to get dummy indicators.
  • columns: *list-like*, (default: *None*) — Column names in the DataFrame to be encoded. By default (None), all the columns with object or category dtype will be converted.
  • prefix: ***str, list of str, or dict of str, (***default: None) — Converted column names will be appended to the given prefix, it can be a single str, a list of strings with the same length of the columns list, or a dict mapping column names to prefixes.
  • drop_first: bool (default: False) — Removing the 1st level to get k-1 dummies of kcategorical level.
df_oh = pd.get_dummies(
    data=df,
    columns=["brand", "color"],
    prefix=["b", "c"])

Screenshot 2022-01-11 004438.jpg

For example, we can notice that the first binary column b_Audi has only one 1 because we have only one car (A3) of this brand (Audi), whereas in the fourth binary column b_VW we have two 1s because of the two cars (Golf and Passat) of that brand VW. We can notice the same for colors columns where we have two Black, two Blue and one Red.

One-Hot Encoding with scikit-learn

The scikit-learn library provides the OneHotEncoder class which is a transformer that takes an array-like of integers or strings and convert it to one-hot numeric array. By default, this transformer returns a sparse matrix but a dense array can be returned by setting the sparse parameter to False.

Before passing the categorical data to the encoder, it would be helpful to construct the list of new columns names as follows:

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

cat_cols = ["brand", "color"]
cat_cols_encoded = []
for col in cat_cols:
  cat_cols_encoded += [f"{col[0]}_{cat}" for cat in list(df[col].unique())]

cat_cols_encoded
['b_VW', 'b_Audi', 'b_Seat', 'b_BMW', 'c_Black', 'c_Blue', 'c_Red']

Once the list of columns names is constructed, we can fit and transform the categorical data using the One-Hot Encoder.

oh_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
encoded_cols = oh_encoder.fit_transform(df[cat_cols])
df_enc = pd.DataFrame(encoded_cols, columns=cat_cols_encoded)
df_oh = df.join(df_enc)
df_oh

Screenshot 2022-01-11 023248.jpg

Conclusion

One-Hot encoding is not the only way to handle categorical variables but its usage is popular in data science among other methods like Label Encoding or Ordinal Encoding, Dummy Encoding etc... Each method has its own pros and cons so it I encourage you to discover other methods in order to decide which one is more suitable for your project.

Feel free to leave a comment or contact me if you have any questions / suggestions.

You can find the Jupyter-Notebook here to reproduce the results shown in this article.

Read my other Articles 🔥

Resources 📚

Discover and read more posts from Abdelfettah Besbes
get started
post commentsBe the first to share your opinion
Qeru Jasd
2 years ago

This is the code that I used after taking guide from this post you can check https://www.digitaca.com/ to see the website where I used this code.

Alex Mrm
2 years ago

One Hot Encoding is a common way of preprocessing categorical features for machine learning models. This type of encoding creates a new binary feature for each possible category and assigns a value of 1 to the feature of each sample that corresponds to its original category. Like https://outdoorbasketballshop.com/best-basketball-hoops-for-kids-and-children/

Show more replies