DhananjayKumar

Principal consultant and full stack developer with 15+ years of experience in developing, deploying and maintaining enterprise applications.

ML with Python: Part-3

Published Sep 30, 2019

In preious post, we saw various steps involved in creating a machine learning (ML) model. You might have noticed in Building ML Model we consider multiple Algorithums in a pipeline and then tune hyperparameter for all the Models. Don't you feel that it would have been easier if some automated tools are there to ease the burden of repetitive and time-consuming tasks of machine learning pipeline design and hyperparameter optimization.

Here comes AutoML, taking over the machine learning model-building process: once a data set is in a relatively clean format, the AutoML system will be able to design and optimize a machine learning pipeline faster than 99% of the humans out there.

There are many such AutoML tools are available and most popular of them are,

TPOT
H2O
Auto-sklearn
Azure AutoML etc.

Below we will see an example of TPOT, rest others also work on similar idea. TPOT works on top of scikit-learn and automate the most tedious part of machine learning by intelligently exploring thousands of possible pipelines to find the best one for your data.

I am using the same Problem statement and DataSet what I used in Part-2. But for simplicity I am dilluting the some of the Pre-processing steps, ss TPOT also applied mutiple pre-processing steps (Lists are provided below). Jupyter Notebook file and training/test files can also be downloaded from my git repository.

Alright, Let's get started -

Import Libraries

import numpy as np 
import pandas as pd 
from tpot import TPOTClassifier

Data Load

test_df = pd.read_csv("test.csv")
train_df = pd.read_csv("train.csv")

Data Cleanup

train_df = train_df.drop(['PassengerId'], axis=1)
train_df = train_df.drop(['Cabin'], axis=1)
test_df = test_df.drop(['Cabin'], axis=1)
train_df = train_df.drop(['Ticket'], axis=1)
test_df = test_df.drop(['Ticket'], axis=1)
train_df = train_df.drop(['Name'], axis=1)
test_df = test_df.drop(['Name'], axis=1)

data = [train_df, test_df]

for dataset in data:
    mean = train_df["Age"].mean()
    std = test_df["Age"].std()
    is_null = dataset["Age"].isnull().sum()
    # compute random numbers between the mean, std and is_null
    rand_age = np.random.randint(mean - std, mean + std, size = is_null)
    # fill NaN values in Age column with random values generated
    age_slice = dataset["Age"].copy()
    age_slice[np.isnan(age_slice)] = rand_age
    dataset["Age"] = age_slice
    dataset["Age"] = train_df["Age"].astype(int)

data = [train_df, test_df]

for dataset in data:
    dataset['Embarked'] = dataset['Embarked'].fillna('S')

Converting Features:

data = [train_df, test_df]

for dataset in data:
    dataset['Fare'] = dataset['Fare'].fillna(0)
    dataset['Fare'] = dataset['Fare'].astype(int)

genders = {"male": 0, "female": 1}
data = [train_df, test_df]

for dataset in data:
    dataset['Sex'] = dataset['Sex'].map(genders)

ports = {"S": 0, "C": 1, "Q": 2}
data = [train_df, test_df]

for dataset in data:
    dataset['Embarked'] = dataset['Embarked'].map(ports)

data = [train_df, test_df]
for dataset in data:
    dataset['Age'] = dataset['Age'].astype(int)
    dataset.loc[ dataset['Age'] <= 11, 'Age'] = 0
    dataset.loc[(dataset['Age'] > 11) & (dataset['Age'] <= 18), 'Age'] = 1
    dataset.loc[(dataset['Age'] > 18) & (dataset['Age'] <= 22), 'Age'] = 2
    dataset.loc[(dataset['Age'] > 22) & (dataset['Age'] <= 27), 'Age'] = 3
    dataset.loc[(dataset['Age'] > 27) & (dataset['Age'] <= 33), 'Age'] = 4
    dataset.loc[(dataset['Age'] > 33) & (dataset['Age'] <= 40), 'Age'] = 5
    dataset.loc[(dataset['Age'] > 40) & (dataset['Age'] <= 66), 'Age'] = 6
    dataset.loc[ dataset['Age'] > 66, 'Age'] = 6

data = [train_df, test_df]

for dataset in data:
    dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
    dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
    dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare']   = 2
    dataset.loc[(dataset['Fare'] > 31) & (dataset['Fare'] <= 99), 'Fare']   = 3
    dataset.loc[(dataset['Fare'] > 99) & (dataset['Fare'] <= 250), 'Fare']   = 4
    dataset.loc[ dataset['Fare'] > 250, 'Fare'] = 5
    dataset['Fare'] = dataset['Fare'].astype(int)

Building Machine Learning Models

X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test  = test_df.drop("PassengerId", axis=1).copy()
tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2)
tpot.fit(X_train, Y_train)

Result:

Generation 1 - Current best internal CV score: 0.8327578288714715
Generation 2 - Current best internal CV score: 0.8327578288714715
Generation 3 - Current best internal CV score: 0.8327578288714715
Generation 4 - Current best internal CV score: 0.833931853718029
Generation 5 - Current best internal CV score: 0.8395310000365276

Best pipeline: RandomForestClassifier(MultinomialNB(input_matrix, alpha=0.1, fit_prior=True), bootstrap=True, criterion=gini, max_features=0.5, min_samples_leaf=4, min_samples_split=17, n_estimators=100)


TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
               disable_update_check=False, early_stop=None, generations=5,
               max_eval_time_mins=5, max_time_mins=None, memory=None,
               mutation_rate=0.9, n_jobs=1, offspring_size=None,
               periodic_checkpoint_folder=None, population_size=20,
               random_state=None, scoring=None, subsample=1.0, template=None,
               use_dask=False, verbosity=2, warm_start=False)

Above you can see TPOT has chossen RandomForestClassifier as best fit pipeline.

Classification algorithms and parameters TPOT chooses -

'sklearn.naive_bayes.BernoulliNB': { 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.], 'fit_prior': [True, False] }
'sklearn.naive_bayes.MultinomialNB': { 'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.], 'fit_prior': [True, False] }
'sklearn.tree.DecisionTreeClassifier': { 'criterion': ['gini', 'entropy'], 'max_depth': range(1, 11), 'min_samples_split': range(2, 21), 'min_samples_leaf': range(1, 21) }
'sklearn.ensemble.ExtraTreesClassifier': { 'n_estimators': [100], 'criterion': ['gini', 'entropy'], 'max_features': np.arange(0.05, 1.01, 0.05), 'min_samples_split': range(2, 21), 'min_samples_leaf': range(1, 21), 'bootstrap': [True, False] }
'sklearn.ensemble.RandomForestClassifier': { 'n_estimators': [100], 'criterion': ['gini', 'entropy'], 'max_features': np.arange(0.05, 1.01, 0.05), 'min_samples_split': range(2, 21), 'min_samples_leaf': range(1, 21), 'bootstrap': [True, False] }
'sklearn.ensemble.GradientBoostingClassifier': { 'n_estimators': [100], 'learning_rate': [1e-3, 1e-2, 1e-1, 0.5, 1.], 'max_depth': range(1, 11), 'min_samples_split': range(2, 21), 'min_samples_leaf': range(1, 21), 'subsample': np.arange(0.05, 1.01, 0.05), 'max_features': np.arange(0.05, 1.01, 0.05) }
'sklearn.neighbors.KNeighborsClassifier': { 'n_neighbors': range(1, 101), 'weights': ['uniform', 'distance'], 'p': [1, 2] }
'sklearn.svm.LinearSVC': { 'penalty': ['l1', 'l2'], 'loss': ['hinge', 'squared_hinge'], 'dual': [True, False], 'tol': [1e-5, 1e-4, 1e-3, 1e-2, 1e-1], 'C': [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.] }
'sklearn.linear_model.LogisticRegression': { 'penalty': ['l1', 'l2'], 'C': [1e-4, 1e-3, 1e-2, 1e-1, 0.5, 1., 5., 10., 15., 20., 25.], 'dual': [True, False] }
'xgboost.XGBClassifier': { 'n_estimators': [100], 'max_depth': range(1, 11), 'learning_rate': [1e-3, 1e-2, 1e-1, 0.5, 1.], 'subsample': np.arange(0.05, 1.01, 0.05), 'min_child_weight': range(1, 21), 'nthread': [1] }

Preprocessors that could be applied by TPOT -

'sklearn.preprocessing.Binarizer': { 'threshold': np.arange(0.0, 1.01, 0.05) }
'sklearn.decomposition.FastICA': { 'tol': np.arange(0.0, 1.01, 0.05) }
'sklearn.cluster.FeatureAgglomeration': { 'linkage': ['ward', 'complete', 'average'], 'affinity': ['euclidean', 'l1', 'l2', 'manhattan', 'cosine'] }
'sklearn.preprocessing.MaxAbsScaler': { }
'sklearn.preprocessing.MinMaxScaler': { }
'sklearn.preprocessing.Normalizer': { 'norm': ['l1', 'l2', 'max'] }
'sklearn.kernel_approximation.Nystroem': {
'kernel': ['rbf', 'cosine', 'chi2', 'laplacian', 'polynomial', 'poly', 'linear', 'additive_chi2', 'sigmoid'],
'gamma': np.arange(0.0, 1.01, 0.05), 'n_components': range(1, 11)
}
'sklearn.decomposition.PCA': {
'svd_solver': ['randomized'],
'iterated_power': range(1, 11) }
'sklearn.preprocessing.PolynomialFeatures': { 'degree': [2], 'include_bias': [False], 'interaction_only': [False] }
'sklearn.kernel_approximation.RBFSampler': { 'gamma': np.arange(0.0, 1.01, 0.05) },
'sklearn.preprocessing.RobustScaler': { },
'sklearn.preprocessing.StandardScaler': { },
'tpot.builtins.ZeroCount': { },
'tpot.builtins.OneHotEncoder': { 'minimum_fraction': [0.05, 0.1, 0.15, 0.2, 0.25], 'sparse': [False] } (emphasis mine)

Prediction

Y_prediction = tpot.predict(X_test)
submission = pd.DataFrame({
        "PassengerId": test_df["PassengerId"],
        "Survived": Y_prediction
    })

Limitation

Running TPOT isn't as simple as fitting one model on the dataset. It is considering multiple machine learning algorithms (random forests, linear models, SVMs, etc.) in a pipeline with numerous preprocessing steps (missing value imputation, scaling, PCA, feature selection, etc.), the hyper-parameters for all of the models and preprocessing steps, as well as multiple ways to ensemble or stack the algorithms within the pipeline. That's why it usually takes a long time to execute and isn't feasible for large datasets.

Summary

All the methods of AutoML are developed to support data scientists, not to replace them. Such methods can free the data scientist from complicated tasks that can be solved better by machines. But analysing and drawing conclusions still has to be done by data scientists who also knows the application domain.

Machine learning Python

Report

Enjoy this post? Give DhananjayKumar a like if it's helpful.

DhananjayKumar

Principal consultant and full stack developer with 15+ years of experience in developing, deploying and maintaining enterprise applications.

Principal consultant and full stack developer with 15+ years of experience in developing, deploying and maintaining enterprise applications. Skillset summary- • Language : C#, Javascript, Typescript, Python, HTML, CSS • D...

Discover and read more posts from DhananjayKumar

get started

2Replies

Aminzai

6 years ago

Thank you sir! Please tell which book I have to follow for Machine Learning which contain proof with mathematics.

DhananjayKumar

6 years ago

Aminzai, There’s no single book that can help you master ML as it’s a complicated subject that spans many topics, purposes, and benefits in real-world applications. Though I didn’t read most of them but would like to suggest “Machine Learning by Peter Flach”. It’s made for intermediate-to-advanced goes into a greater amount of detail than other books.