Mohammed Innat

Machine Learning Engineer | Kaggler

Kaggl Titanic: A Machine Learning from Disaster | Modelling Part 2

Published Nov 09, 2019

Continue to Part 1.

Previously, we did following

explore the data set
advance feature engineering

However, to get a sneak peak on the whole article (part 1 and 2), open up this notebook viewer and if you want to run each notebook cell, you can also use binder

Or go to his Kaggle-Play repo and launch binder to run notebook cell.

OK, it's time to build the model for our survival prediction problem.

TL;DR

Predictive Modeling

Here, we split our datasets according to the previous amounts and make test and train set. To avoid overfitting event we can create validation set but that's not effective. So, we use K-Fold approaches and use StratifiedKFold to split the train datasets into 10 (by default).

# Separate train dataset and test dataset
train = dataset[:len(train)]
test = dataset[len(train):]
test.drop(labels=["Survived"],axis = 1,inplace = True)

## Separate train features and label 
Y_train = train["Survived"].astype(int)
X_train = train.drop(labels = ["Survived"],axis = 1)

# Cross validate model with Kfold stratified cross val
K_fold = StratifiedKFold(n_splits=10)

Classifier

I compare 10 popular classifiers and evaluate the mean accuracy of each of them by a stratified kfold cross validation procedure.

KNN
AdaBoost
Decision Tree
Random Forest
Extra Trees
Support Vector Machine
Gradient Boosting
Logistic regression
Linear Discriminant Analysis
Multiple layer perceprton

Evaluation using Cross Validation

A great alternative is to use Scikit-Learn's cross-validation feature. The following performs K-fold cross validation; it randomly splits the training set into 10 distinct subsets called folds, then it trains and evaluates the Models 10 times, picking a different fold for evaluation every time and training on the other 9 folds.

# Modeling step Test differents algorithms 
random_state = 2

models = [] # append all models or predictive models 
cv_results = [] # cross validation result
cv_means = [] # cross validation mean value
cv_std = [] # cross validation standard deviation

models.append(KNeighborsClassifier())
models.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.1))
models.append(DecisionTreeClassifier(random_state=random_state))
models.append(RandomForestClassifier(random_state=random_state))
models.append(ExtraTreesClassifier(random_state=random_state))
models.append(SVC(random_state=random_state))
models.append(GradientBoostingClassifier(random_state=random_state))
models.append(LogisticRegression(random_state = random_state))
models.append(LinearDiscriminantAnalysis())
models.append(MLPClassifier(random_state=random_state))


for model in models :
    cv_results.append(cross_val_score(model, X_train, Y_train, 
                                      scoring = "accuracy", cv = K_fold, n_jobs=4))

for cv_result in cv_results:
    cv_means.append(cv_result.mean())
    cv_std.append(cv_result.std())

cv_frame = pd.DataFrame(
    {
        "CrossValMeans":cv_means,
        "CrossValErrors": cv_std,
        "Algorithms":[
                     "KNeighboors",
                     "AdaBoost", 
                     "DecisionTree",   
                     "RandomForest",
                     "ExtraTrees",
                     "SVC",
                     "GradientBoosting",                      
                     "LogisticRegression",
                     "LinearDiscriminantAnalysis",
                     "MultipleLayerPerceptron"]
    })

cv_plot = sns.barplot("CrossValMeans","Algorithms", data = cv_frame,
                palette="husl", orient = "h", **{'xerr':cv_std})

cv_plot.set_xlabel("Mean Accuracy")
cv_plot = cv_plot.set_title("CV Scores")

fig 22.png

Let's explore following models separately:

GBC Classifier
Linear Discriminant Analysis
Logistic Regression
Random Forest Classifer
Gaussian Naive Bayes
Support Vectore Machine

Let's start with Gradient Boosting Classifier.

# GBC Classifier
GBC_Model = GradientBoostingClassifier()

scores = cross_val_score(GBC_Model, X_train, Y_train, cv = K_fold,
                       n_jobs = 4, scoring = 'accuracy')

print(scores)
round(np.mean(scores)*100, 2)

# output

[0.83146067 0.82954545 0.76136364 0.89772727 0.90909091 0.875
 0.84090909 0.79545455 0.84090909 0.82954545]
84.11

Next, LDA

# Linear Discriminant Analysis 
LDA_Model= LinearDiscriminantAnalysis()

scores = cross_val_score(LDA_Model, X_train, Y_train, cv = K_fold,
                       n_jobs = 4, scoring = 'accuracy')

print(scores)
round(np.mean(scores)*100, 2)

# output
[0.84269663 0.82954545 0.76136364 0.88636364 0.81818182 0.80681818
 0.79545455 0.78409091 0.86363636 0.84090909]
82.29

Logistic Regression classifier.

# Logistic Regression
#
Log_Model = LogisticRegression(C=1)
scores = cross_val_score(Log_Model, X_train, Y_train, cv=K_fold, 
                        n_jobs=4, scoring='accuracy')

print(scores)
round(np.mean(scores)*100, 2)

# output
[0.83146067 0.81818182 0.76136364 0.875      0.81818182 0.77272727
 0.79545455 0.79545455 0.84090909 0.84090909]
81.5

Random Forest is typically an ensemble of decesion tree classifer. It should perform better than all. Let's see.

# Random Forest Classifier Model
#
RFC_model = RandomForestClassifier(n_estimators=10)
scores = cross_val_score(RFC_model, X_train, Y_train, cv=K_fold, 
                        n_jobs=4, scoring='accuracy')

print(scores)
round(np.mean(scores)*100, 2)

# output
[0.79775281 0.88636364 0.73863636 0.80681818 0.86363636 0.79545455
 0.82954545 0.76136364 0.84090909 0.82954545]
81.5

Gaussian NB performs pretty well on binary classification.

# Gaussian Naive Bayes
GNB_Model = GaussianNB()

scores = cross_val_score(GNB_Model, X_train, Y_train, cv=K_fold, 
                        n_jobs=4, scoring='accuracy')

print(scores)
round(np.mean(scores)*100, 2)

# output
[0.78651685 0.81818182 0.75       0.86363636 0.77272727 0.79545455
 0.80681818 0.78409091 0.85227273 0.84090909]
80.71

Support Vector Machine or SVM is pretty much promsing ML algorithm. It should perform well also.

# Support Vector Machine
SVM_Model = SVC()

scores = cross_val_score(SVM_Model, X_train, Y_train, cv=K_fold, 
                        n_jobs=4, scoring='accuracy')

print(scores)
round(np.mean(scores)*100, 2)

# output 
[0.69662921 0.65909091 0.64772727 0.72727273 0.76136364 0.70454545
 0.76136364 0.73863636 0.72727273 0.78409091]
72.08

Hyperparameter Tuning

I decided to choose this promising models of GradientBoosting, Linear Discriminant Analysis, RandomForest, Logistic Regression and SVM for the ensemble modeling. So, now we need to fine-tune them.

One way to do that would be to fiddle with the hyperparameters manually until we find a great combination of hyperparamerter values. This would be very tedious work, and we may not have time to explore many combination. Instead we should get Scikit-Learn's GridSearchCV to search for us. All we need to do is tell it which hyperparameters we want it to experiment with, and what values to try out and it will evaluate all the possible combination of hyperparameter values, using cross-validation.

Here we perform grid search optimization for GradientBoosting, RandomForest, Linear Discriminant Analysis, Logistic Regression and SVC classifier.

Hyper-Parameter Tuning on GBC

# Gradient boosting tunning
GBC = GradientBoostingClassifier()
gb_param_grid = {
              'loss' : ["deviance"],
              'n_estimators' : [100,200,300],
              'learning_rate': [0.1, 0.05, 0.01, 0.001],
              'max_depth': [4, 8,16],
              'min_samples_leaf': [100,150,250],
              'max_features': [0.3, 0.1]
              }

gsGBC = GridSearchCV(GBC, param_grid = gb_param_grid, cv=K_fold, 
                     scoring="accuracy", n_jobs= 4, verbose = 1)

gsGBC.fit(X_train,Y_train)
GBC_best = gsGBC.best_estimator_

# Best score
gsGBC.best_score_

#output
Fitting 10 folds for each of 216 candidates, totalling 2160 fits

[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    2.6s
[Parallel(n_jobs=4)]: Done 626 tasks      | elapsed:   12.9s
[Parallel(n_jobs=4)]: Done 1626 tasks      | elapsed:   30.5s
[Parallel(n_jobs=4)]: Done 2160 out of 2160 | elapsed:   41.0s finished

0.8365493757094211

Hyper-Parameter Tuning on RFC

# RFC Parameters tunning 
RFC = RandomForestClassifier()

## Search grid for optimal parameters
rf_param_grid = {"max_depth": [None],
              "min_samples_split": [2, 6, 20],
              "min_samples_leaf": [1, 4, 16],
              "n_estimators" :[100,200,300,400],
              "criterion": ["gini"]}


gsRFC = GridSearchCV(RFC, param_grid = rf_param_grid, cv=K_fold,
                     scoring="accuracy", n_jobs= 4, verbose = 1)

gsRFC.fit(X_train,Y_train)
RFC_best = gsRFC.best_estimator_

# Best score
gsRFC.best_score_

# output
Fitting 10 folds for each of 36 candidates, totalling 360 fits

[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    5.5s
[Parallel(n_jobs=4)]: Done 192 tasks      | elapsed:   18.7s
[Parallel(n_jobs=4)]: Done 360 out of 360 | elapsed:   32.5s finished

0.8422247446083996

Hyper-Parameter Tuning on LR

# LogisticRegression Parameters tunning 
LRM = LogisticRegression()

## Search grid for optimal parameters
lr_param_grid = {"penalty" : ["l2"],
              "tol" : [0.0001,0.0002,0.0003],
              "max_iter": [100,200,300],
              "C" :[0.01, 0.1, 1, 10, 100],
              "intercept_scaling": [1, 2, 3, 4],
              "solver":['liblinear'],
              "verbose":[1]}


gsLRM = GridSearchCV(LRM, param_grid = lr_param_grid, cv=K_fold,
                     scoring="accuracy", n_jobs= 4, verbose = 1)

gsLRM.fit(X_train,Y_train)
LRM_best = gsLRM.best_estimator_

# Best score
gsLRM.best_score_

# output
Fitting 10 folds for each of 180 candidates, totalling 1800 fits

[Parallel(n_jobs=4)]: Done 351 tasks      | elapsed:    2.6s
[LibLinear]
[Parallel(n_jobs=4)]: Done 1800 out of 1800 | elapsed:    4.4s finished

0.8240635641316686

Hyper-Parameter Tuning on LDA

# Linear Discriminant Analysis - Parameter Tuning
LDA = LinearDiscriminantAnalysis()

## Search grid for optimal parameters
lda_param_grid = {"solver" : ["svd"],
              "tol" : [0.0001,0.0002,0.0003]}


gsLDA = GridSearchCV(LDA, param_grid = lda_param_grid, cv=K_fold,
                     scoring="accuracy", n_jobs= 4, verbose = 1)

gsLDA.fit(X_train,Y_train)
LDA_best = gsLDA.best_estimator_

# Best score
gsLDA.best_score_

# output
Fitting 10 folds for each of 3 candidates, totalling 30 fits

[Parallel(n_jobs=4)]: Done  23 out of  30 | elapsed:    1.9s remaining:    0.5s
[Parallel(n_jobs=4)]: Done  30 out of  30 | elapsed:    1.9s finished

0.8229284903518729

Hyper-Parameter Tuning on SVC

### SVC classifier
SVMC = SVC(probability=True)
svc_param_grid = {'kernel': ['rbf'], 
                  'gamma': [0.0001, 0.001, 0.01, 0.1, 1],
                  'C': [1, 10, 50, 100, 200, 300]}

gsSVMC = GridSearchCV(SVMC, param_grid = svc_param_grid, cv = K_fold,
                      scoring="accuracy", n_jobs= -1, verbose = 1)

gsSVMC.fit(X_train,Y_train)

SVMC_best = gsSVMC.best_estimator_

# Best score
gsSVMC.best_score_

# output
Fitting 10 folds for each of 30 candidates, totalling 300 fits

[Parallel(n_jobs=-1)]: Done  50 tasks      | elapsed:    3.2s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:   17.3s finished

0.8161180476730987

Plot Learning Curves

Diagnose Bias and Variance to Reduce Error

Learning curves are a good way to see the overfitting and underfitting effect on the training set and the effect of the training size on the accuracy. Learning curves plots the model's performance on the training set and the validation set as a function of training set size. To generate the plots, we simply train the model several times on different sized subsets of the training sets. In a nutshell, a learning curves shows how error changes as the training set size increases.
If a models perform well on the training data but generalizes poorly according to the cross-validation metrics, the model is called overfitting. And again if it performs poorly on both, the model is called underfitting.

When the model is trained on very few training instances, it is incapable of generalizing properly, which is why the validation error will be initially quite big.

Underfitting: If model is underfitting the training data, adding more training example will not help. We need to use more complex model or come up with better features.

Overfitting: One way to improve the overfitting model is to feed it more training data until the validation error reaches the training error.

Resource:

Learning Curves for Machine Learning

Bias-Variance Trade-Off

A model's generalization error can be expressed as the sum of three very different errors.

Bias
Variance
Irreducible Error

Bias Error in Learning Curve
This part of generalization error is due to the wrong assumption, such as assuming that, the data is linear when it is actually quadratic.

A high bias model is most likely to underfit the training data

Variance Error in Learning Curve
This part of generalization is due to the model is excessive sensitivity to small variations in the training data.

A high variance model is most likely to overfit the training data

Irreducible Error in Learning Curve
This is due to the noisiness of the data itself. This is not concern now, because we already clean the data sets

Increasing a model's complexity will typically increases its variance and reduce its bias. Conversly, reducing a model's complexity increases its bias and reduces its variance.

Now, we'll define a learning curve ploting function where x and y axies will be traning set size and scores (not errors) gradually. So the higher the score, the better the performance of the model.

# Plot learning curve
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None,
                        n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)):
    """
    Generate a simple plot of the test and traning learning curve.

    Parameters
    ----------
    estimator : object type that implements the "fit" and "predict" methods
        An object of that type which is cloned for each validation.

    title : string
        Title for the chart.

    X : array-like, shape (n_samples, n_features)
        Training vector, where n_samples is the number of samples and
        n_features is the number of features.

    y : array-like, shape (n_samples) or (n_samples, n_features), optional
        Target relative to X for classification or regression;
        None for unsupervised learning.

    ylim : tuple, shape (ymin, ymax), optional
        Defines minimum and maximum yvalues plotted.

    cv : integer, cross-validation generator, optional
        If an integer is passed, it is the number of folds (defaults to 3).
        Specific cross-validation objects can be passed, see
        sklearn.cross_validation module for the list of possible objects

    n_jobs : integer, optional
        Number of jobs to run in parallel (default 1).
        
    x1 = np.linspace(0, 10, 8, endpoint=True) produces
        8 evenly spaced points in the range 0 to 10
    """
    
    
    plt.figure()
    plt.title(title)
    if ylim is not None:
        plt.ylim(*ylim)
        
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    train_sizes, train_scores, test_scores = learning_curve(
        estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes)
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    test_scores_mean = np.mean(test_scores, axis=1)
    test_scores_std = np.std(test_scores, axis=1)
    plt.grid()

    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                     train_scores_mean + train_scores_std, alpha=0.1,
                     color="r")
    plt.fill_between(train_sizes, test_scores_mean - test_scores_std,
                     test_scores_mean + test_scores_std, alpha=0.1, color="g")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="r",
             label="Training score")
    plt.plot(train_sizes, test_scores_mean, 'o-', color="g",
             label="Cross-validation score")

    plt.legend(loc="best")
    return plt

# Gradient boosting - Learning Curve 
plot_learning_curve(estimator = gsGBC.best_estimator_,title = "GBC learning curve",
                    X = X_train, y = Y_train, cv = K_fold);

# Random Forest - Learning Curve
plot_learning_curve(estimator = gsRFC.best_estimator_ ,title = "RF learninc curve",
                    X = X_train, y = Y_train, cv = K_fold);

# Logistic Regression - Learning Curve    gsLRM.best_estimator_
plot_learning_curve(estimator = Log_Model ,title = "Logistic Regression - Learning Curve",
                    X = X_train, y = Y_train, cv = K_fold);

# Linear Discriminant Analysis - Learning Curve
plot_learning_curve(estimator = gsLDA.best_estimator_ ,title = "Linear Discriminant - Learning Curve",
                    X = X_train, y = Y_train, cv = K_fold);

# Support Vector Machine - Learning Curve
plot_learning_curve(estimator = gsSVMC.best_estimator_,title = "SVC learning curve",
                    X = X_train, y = Y_train, cv = K_fold);

SVC seem to better generalize the prediction since the training and cross-validation curves are close together. And again Random Forest and GradientBoosting classifiers tend to overfit the training set. One way to improve the overfitting model is to feed it more training data until the validation error reaches the training error.

Ensemble modeling

The another way to fine-tune our system is to try to combine the models that perform best. The goup will often perform better than the best individual model, especially if the individual models make very different types of errors.

Building a model on top of many other models are called Ensemble Learning. And it is often a great way to push ML algorithm even further.

I use voting classifier to combine the predictions coming from the 2 classifiers. I preferred to pass the argument soft to the voting parameter to take into account the probability of each vote.

#about 84%
VotingPredictor = VotingClassifier(estimators =
                           [('rfc', RFC_best), 
                            ('gbc', GBC_best)],
                           voting='soft', n_jobs = 4)


VotingPredictor = VotingPredictor.fit(X_train, Y_train)

scores = cross_val_score(VotingPredictor, X_train, Y_train, cv = K_fold,
                       n_jobs = 4, scoring = 'accuracy')

print(scores)
print(round(np.mean(scores)*100, 2))

# output

[0.79775281 0.84090909 0.72727273 0.90909091 0.90909091 0.85227273
 0.85227273 0.77272727 0.88636364 0.84090909]
83.89 # score increased

Submit Predictor

Predictive_Model = pd.DataFrame({
        "PassengerId": TestPassengerID,
        "Survived": VotingPredictor.predict(test)})

Predictive_Model.to_csv('titanic_model.csv', index=False)

By this, we reach to the end of this series. You can find Part 1 from here. However, you may like to run each notebook cell, so use binder, it's awesome.

You can get the source code of whole (part 1 & 2) demonstration from the link below and can also follow me on GitHub for future code updates. Source Code : Titanic:ML

AI Machine learning Data analysis Data visualization

Report

Enjoy this post? Give Mohammed Innat a like if it's helpful.

Mohammed Innat

Machine Learning Engineer | Kaggler

We’re passionate about applying knowledge of Data Science and Machine Learning to areas in HealthCare where we can really Engineer some better solutions. If you’re working in Healthcare, don’t hesitate to reach out if you think t...

Discover and read more posts from Mohammed Innat

get started

2Replies

Simon

6 years ago

Loving this! All theoretical knowledge, with the code to back it up. Binder was really useful.

Mohammed Innat

6 years ago

Glad to hear that.