Codementor Events

Kaggl Titanic: A Machine Learning from Disaster | Feature Eng. Part 1

Published Nov 05, 2019Last updated Nov 09, 2019

I barely remember first when exactly I watched Titanic movie but still now Titanic remains a discussion subject in the most diverse areas. The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. It was April 15-1912 during her maiden voyage, the Titanic sank after colliding with an iceberg and killing 1502 out of 2224 passengers and crew.

In kaggle challenge, we're asked to complete the analysis of what sorts of people were likely to survive. In particular, we're asked to apply the tools of machine learning to predict which passengers survived the tragedy.

More challenge information and the datasets are available on Kaagle Titanic Page The datasets has been split into two groups:

training set (train.csv)
test set (test.csv)

TL;DR

Look at the Big Picture

The goal is to build a Model that can predict the survival or the death of a given passenger based on a set of variables describing their such as age, sex, or passenger class on the boat.

Frame the Problem

To frame the ML problem elegantly, is very much important because it will determine our problem spaces. What algorithms we will select, what performance measure we will use to evaluate our model and also how much effort we should spend tweaking it.

The test set should be used to see how well our model performs on unseen data. For the test set, the ground truth for each passenger is not provided. It is our job to predict these outcomes. For each passenger in the test set, we use the trained model to predict whether or not they survived the sinking of the Titanic. We will use Cross-validation for evaluating estimator performance.

Basically, we've two datasets are available, a train set and a test set. We'll be using the training set to build our predictive model and the testing set will be used to validate that model. This is a binary classification problem.

To solve this ML problem, topics like feature analysis, data visualization, missing data imputation, feature engineering, model fine tuning and various classification models will be addressed for ensemble modeling.

Preprocessing

In Data Science or ML problem spaces, Data Preprocessing means a lot, which is to make the Data usable or clean before using it, like before fit the model.

Now, the real world data is so messy, like following -

  • inconsistant values
  • duplicate records
  • missing values
  • invalid data
  • outlier

So what? Actually this is a matter of big concern. Because, Model can't handle missing data. So, we need to handle this manually. Actually there're many approaches we can take to handle missing value in our data sets, such as-

  • Remove observation/records that have missing values. But..

    • data may randomly missing, so by doing this we may loss a lots of data
    • data may non-randomly missing, so by doing this we may also loss a lots of data, again we're also introducing potential biases
  • Imputation

    • replace missing values with another values
    • strategies: mean, median or highest frequency value of the given feature

Table of Contents

The steps we will go through are as follows:

  • Get The Data and Explore
    Here we'll explore what inside of the dataset and based on that we'll make our first commit on it.

  • Feature Analysis To Gain Insights
    First we try to find out outlier from our datasets. There're many method to dectect outlier but here we will use tukey method to detect it. Then we will do component analysis of our features.

  • Feature Engineering
    Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. And here, in our datasets there are few features that we can do engineering on it. I like to choose two of them.

    • Name
    • Family Size
  • Predictive Modeling (In Part 2)
    Here, we will use various classificatiom models and compare the results. We'll use Cross-validation for evaluating estimator performance and fine-tune the model and observe the learning curve, of best estimator and finally, will do enseble modeling of with three best predictive model.

  • Submit Predictor
    Create a CSV file and submit to Kaggle.


Import

At first we will load some various libraries. It may be confusing but we will see the use cases each of them in details later on.

# Data Processing and Visualization Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

#  Data Modelling Libraries
from sklearn.ensemble import (RandomForestClassifier, AdaBoostClassifier,
                             GradientBoostingClassifier, ExtraTreesClassifier,
                             VotingClassifier)

from sklearn.model_selection import (GridSearchCV, cross_val_score, cross_val_predict,
                                     StratifiedKFold, learning_curve)

from sklearn.metrics import (confusion_matrix, accuracy_score) 
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

import warnings
from collections import Counter

sns.set(style = 'white' , context = 'notebook', palette = 'deep')
warnings.filterwarnings('ignore', category = DeprecationWarning)
%matplotlib inline

Get Data Sets

Using pandas, we now load the dataset. Basically two files, one is for training purpose and other is for testng.

# load the datasets using pandas's read_csv method
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# concat these two datasets, this will come handy while processing the data
dataset =  pd.concat(objs=[train, test], axis=0).reset_index(drop=True)

# separately store ID of test datasets, 
# this will be using at the end of the task to predict.
TestPassengerID = test['PassengerId']

Look Inside

Let's look what we've just loaded. Datasets size, shape, short description and few more.

# shape of the data set
train.shape # (891, 12)

So it has 891 samples with 12 features. That's somewhat big, let's see top 5 sample of it.

# first 5 records
train.head()

  PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

It's more convenient to run each code snippet on jupyter cell.

Definitions of each features and quick thoughts:

  • PassengerId. Unique identification of the passenger. It shouldn't be necessary for the machine learning model.
  • Survived. Survival (0 = No, 1 = Yes). Binary variable that will be our target variable.
  • Pclass. Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd). Ready to go.
  • Name. Name of the passenger. We need to parse before using it.
  • Sex. Gender Categorical variable that should be encoded. We can use dummy -variable to encode it.
  • Age. Age in years.
  • SibSp. Siblings / Spouses aboard the Titanic.
  • Parch. Parents / Children aboard the Titanic.
  • Ticket. Ticket number. Big mess.
  • Fare. Passenger fare.
  • Cabin. Cabin number.
  • Embarked. Port of Embarkation , C = Cherbourg, Q = Queenstown, S = Southampton. Categorical feature that should be encoded. We can use feature mapping or make dummy vairables for it.

The main conclusion is that we already have a set of features that we can easily use in our machine learning model. But features like Name, Ticket, Cabin require an additional effort before we can integrate them.

# using info method we can get quick overview of the data sets
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

One things to notice, we have 891 samples or entries but columns like Age, Cabin and Embarked have some missing values. We can't ignore those. However, let's generate the descriptive statistics to get the basic quantitative information about the features of our data set.

# Descriptive Statistics
train.describe()

      PassengerId	Survived	Pclass	Age	SibSp		Parch		Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

There are three aspects that usually catch my attention when I analyse descriptive statistics:

  • Min and max values: This can give us an idea about the range of values and is helpful to detect outliers.
  • Mean and standard deviation: The mean shows us the central tendency of the distribution, while the standard deviation quantifies its amount of variation.
  • Count: Give us a first perception about the volume of missing data.

Let's define a function for missing data analysis more in details.

# Create table for missing data analysis
def find_missing_data(data):
    Total = data.isnull().sum().sort_values(ascending = False)
    Percentage = (data.isnull().sum()/data.isnull().count()).sort_values(ascending = False)
    
    return pd.concat([Total,Percentage] , axis = 1 , keys = ['Total' , 'Percent'])
# run the method
find_missing_data(train)

Total	Percent
Cabin	687	0.771044
Age	177	0.198653
Embarked	2	0.002245
Fare	0	0.000000
Ticket	0	0.000000
Parch	0	0.000000
SibSp	0	0.000000
Sex	0	0.000000
Name	0	0.000000
Pclass	0	0.000000
Survived	0	0.000000
PassengerId	0	0.000000

Let's create a heatmap plot to visualize the amount of missing values.

# checking only train set - visualize
sns.heatmap(train.isnull(), cbar = False , 
            yticklabels = False , cmap = 'viridis')

fig 1.png

We can see that, Cabin feature has terrible amount of missing values, around 77% data are missing. Until now, we only see train datasets, now let's see amount of missing values in whole datasets.

find_missing_data(dataset)
  Total	Percent
Cabin	1014	0.774637
Survived	418	0.319328
Age	263	0.200917
Embarked	2	0.001528
Fare	1	0.000764
Ticket	0	0.000000
SibSp	0	0.000000
Sex	0	0.000000
Pclass	0	0.000000
PassengerId	0	0.000000
Parch	0	0.000000
Name	0	0.000000
# checking only datasets set
sns.heatmap(dataset.isnull(), cbar = False , 
            yticklabels = False , cmap = 'viridis')

fig 2.png

As it mentioned earlier, ground truth of test datasets are missing.

Problem Spaces

So, about train data set we've seen its internal components and find some missing values there. We've also seen many observations with concern attributes.

Task: The goal is to predict the survival or the death of a given passenger based on a set of variables describing their such as age, sex, or passenger class on the boat.

So, Survived is our target variable, This is the variable we're going to predict. 1 represent survived , 0 represent not survived. And rest of the attributes are called feature variables, based on those we need to build a model which will predict whether a passenger survived or not.

Preprocessing

In Data Science or ML contexts, Data Preprocessing means to make the Data usable or clean before using it, like before fit the model.

Now, the real world data is so messy, they're like -

  • inconsistant
  • duplicate records
  • missing values
  • invalid data
  • outlier

Feature Analysis

Outlier Detection

There are many method to detect outlier. We will use Tukey Method to accomplish it.

# Outlier detection 
def detect_outliers(df,n,features):
    """
    Takes a dataframe df of features and returns a list of the indices
    corresponding to the observations containing more than n outliers according
    to the Tukey method.
    """
    outlier_indices = []
    
    # iterate over features(columns)
    for col in features:
        
        # 1st quartile (25%)
        Q1 = np.percentile(df[col], 25)
        
        # 3rd quartile (75%)
        Q3 = np.percentile(df[col],75)
        
        # Interquartile range (IQR)
        IQR = Q3 - Q1
        
        # outlier step
        outlier_step = 1.5 * IQR
        
        # Determine a list of indices of outliers for feature col
        outlier_list_col = df[(df[col] < Q1 - outlier_step) | 
                              (df[col] > Q3 + outlier_step )].index
        # append the found outlier indices for col to the list of outlier indices 
        outlier_indices.extend(outlier_list_col)
   
        
    # select observations containing more than 2 outliers
    outlier_indices = Counter(outlier_indices)  

    multiple_outliers = list( k for k, v in outlier_indices.items() if v > n )
    return multiple_outliers   

# detect outliers from Age, SibSp , Parch and Fare
Outliers_to_drop = detect_outliers(train,2,["Age","SibSp","Parch","Fare"])
# Show the outliers rows
train.loc[Outliers_to_drop]


PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
27	28	0	1	Fortune, Mr. Charles Alexander	male	19.0	3	2	19950	263.00	C23 C25 C27	S
88	89	1	1	Fortune, Miss. Mabel Helen	female	23.0	3	2	19950	263.00	C23 C25 C27	S
159	160	0	3	Sage, Master. Thomas Henry	male	NaN	8	2	CA. 2343	69.55	NaN	S
180	181	0	3	Sage, Miss. Constance Gladys	female	NaN	8	2	CA. 2343	69.55	NaN	S
201	202	0	3	Sage, Mr. Frederick	male	NaN	8	2	CA. 2343	69.55	NaN	S
324	325	0	3	Sage, Mr. George John Jr	male	NaN	8	2	CA. 2343	69.55	NaN	S
341	342	1	1	Fortune, Miss. Alice Elizabeth	female	24.0	3	2	19950	263.00	C23 C25 C27	S
792	793	0	3	Sage, Miss. Stella Anna	female	NaN	8	2	CA. 2343	69.55	NaN	S
846	847	0	3	Sage, Mr. Douglas Bullen	male	NaN	8	2	CA. 2343	69.55	NaN	S
863	864	0	3	Sage, Miss. Dorothy Edith "Dolly"	female	NaN	8	2	CA. 2343	69.55	NaN	S
# Drop outliers
train = train.drop(Outliers_to_drop, axis = 0).reset_index(drop=True)

# after removing outlier, let's re-concat the data sets
dataset =  pd.concat(objs=[train, test], axis=0).reset_index(drop=True)

Now that we've removed outlier, let's analysis the various features and in the same time we'll also handle the missing value during analysis.

  • Numerical Analysis
  • Categorical Analysi

Numerical Analysis

At first let's analysis the correlation of 'Survived' features with the other numerical features like 'SibSp', 'Parch', 'Age', 'Fare'.

# Correlation matrix between numerical values (SibSp Parch Age and Fare values) and Survived 
corr_numeric = sns.heatmap(dataset[["Survived","SibSp","Parch","Age","Fare"]].corr(),
                           annot=True, fmt = ".2f", cmap = "summer")

fig 3.png

Only Fare feature seems to have a significative correlation with the survival probability.

But it doesn't make other features useless. Subpopulations in these features can be correlated with the survival. To estimate this, we need to explore in detail these features.

Age

Let's first look the age distribution among survived and not survived passengers.

# Explore the Age vs Survived features
age_survived = sns.FacetGrid(dataset, col='Survived')
age_survived = age_survived.map(sns.distplot, "Age")

fig 4.png

So, It's look like age distributions are not the same in the survived and not survived subpopulations. Indeed, there is a peak corresponding to young passengers, that have survived. We also see that passengers between 60-80 have less survived. So, even if "Age" is not correlated with "Survived", we can see that there is age categories of passengers that of have more or less chance to survive.

It seems that very young passengers have more chance to survive. Let's look one for time.

fig = sns.FacetGrid(dataset, hue = 'Survived', aspect = 4)
fig.map(sns.kdeplot, 'Age' , shade = True)
fig.set(xlim = (0, dataset['Age'].max()))
fig.add_legend()

fig 5.png

Again we see that aged passengers between 65-80 have less survived.

Missing Age Value

We have seen significantly missing values in Age coloumn. Missing Age value is a big issue, to address this problem, I've looked at the most correlated features with Age. Let's first try to find correlation between Age and Sex features.

# visualize this using box plot
AS = sns.factorplot(y="Age", x="Sex", data = dataset, kind="box")

fig 6.png

Age distribution seems to be almost same in Male and Female subpopulations, so Sex is not informative to predict Age. Let's explore age and pclass distribution.

facet = sns.FacetGrid(dataset, hue="Pclass", aspect=4)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend()
plt.show()

fig 7.png

So, we see there're more young people from class 3. First class passenger seems more aged than second class and third class are following. But we can't get any information to predict age. But let's try an another approach to visualize with the same parameter.

# using boxplot 
PA = sns.factorplot(data = dataset , x = 'Pclass' , y = 'Age', kind = 'box')

fig 8.png

Here, we can get some information, First class passengers are older than 2nd class passengers who are also older than 3rd class passengers. We can easily visaulize that roughly 37, 29, 24 respectively are the median values of each classes. The strategy can be used to fill Age with the median age of similar rows according to Pclass.

# a custom function for age imputation
def AgeImpute(df):
    Age = df[0]
    Pclass = df[1]
    
    if pd.isnull(Age):
        if Pclass == 1: return 37
        elif Pclass == 2: return 29
        else: return 24
    else:
        return Age

# Age Impute
dataset['Age'] = dataset[['Age' , 'Pclass']].apply(AgeImpute, axis = 1)
# age featured imputed; no missing age records
sns.heatmap(dataset.isnull(), yticklabels = False, cbar = False, cmap = 'summer')

fig 9.png

Fare feature missing some values. However, we will handle it later.

SibSP

Now, let's look Survived and SibSp features in details.

# Explore SibSp feature vs Survived
# We'll use factorplot to analysis
Sib_Sur = sns.factorplot(x="SibSp",y="Survived",data=train,
                   kind="bar", size = 6 , palette = "Blues")

Sib_Sur.despine(left=True)
Sib_Sur = Sib_Sur.set_ylabels("survival probability")

fig 10.png

It seems that passengers having a lot of siblings/spouses have less chance to survive. Single passengers (0 SibSP) or with two other persons (SibSP 1 or 2) have more chance to survive.

Parch

Let's look Survived and Parch features in details.

# Explore Parch feature vs Survived
# We'll use factorplot to analysis
Sur_Par = sns.factorplot(x="Parch",y="Survived",data=train, 
                         kind="bar", size = 6 , palette = "GnBu_d")

Sur_Par.despine(left=True)
Sur_Par = Sur_Par.set_ylabels("survival probability")

fig 11.png

Small families have more chance to survive, more than single.

Fare

Let's look Survived and Fare features in details. We have seen that, Fare feature also mssing some values. Let's handle it first.

dataset["Fare"].isnull().sum() # 1

Since we have one missing value , I liket to fill it with the median value.

dataset["Fare"] = dataset["Fare"].fillna(dataset["Fare"].median())

Categorical values

We can turn categorical values into numerical values. This is simply needed because of feeding the traing data to model. We can use feature mapping or create dummy variables.

sex

Let's take a quick look of values in this features.

print(dataset['Sex'].head()) # top 5
print(' ')
print(dataset['Sex'].tail()) # last 5
0      male
1    female
2    female
3    female
4      male
Name: Sex, dtype: object
 
1294      male
1295    female
1296      male
1297      male
1298      male
Name: Sex, dtype: object

Model can not take such values. We need to map the sex column to numeric values, so that our model can digest.

# convert Sex into categorical value 0 for male and 1 for female
sex = pd.get_dummies(dataset['Sex'], drop_first = True)
dataset = pd.concat([dataset,sex], axis = 1)

# After now, we really don't need to Sex features, we can drop it.
dataset.drop(['Sex'] , axis = 1 , inplace = True)

Let see how much people survived based on their gender. We can guess though, Female passenger survived more than Male, this is just assumption though. In the movie, we heard that Women and Children First.

# using countplot to estimate amount
sns.countplot(data = train , x = 'Survived' , hue = 'Sex', palette = 'GnBu_d')

fig 12.png

# let's see the percentage
train[["Sex","Survived"]].groupby('Sex').mean()

  Survived
Sex	
female	0.747573
male	0.190559

It is clearly obvious that Male have less chance to survive than Female. This is heavily an important feature for our prediction task.

Pclass

Let's explore passenger calsses feature with age feature. From this we can know, how much children, young and aged people were in different passenger class.

facet = sns.FacetGrid(train, hue="Pclass",aspect=4)
facet.map(sns.kdeplot,'Age',shade= True)
facet.set(xlim=(0, train['Age'].max()))
facet.add_legend()
plt.show()

fig 13.png

So, we see there're more young people from class 3. And more aged passenger were in first class, and that indicate that they're rich. So, most of the young people were in class three.

However, let's explore the Pclass vs Survived using Sex feature. This will give more information about the survival probability of each classes according to their gender.

Survived_Pcalss = sns.factorplot(x="Pclass", y="Survived", 
                                 hue="Sex", data=train,size=6, 
                                 kind="bar", palette="BuGn_r")
Survived_Pcalss.despine(left=True)
Survived_Pcalss = Survived_Pcalss.set_ylabels("survival probability")

fig 14.png

The passenger survival is not the same in the all classes. First class passengers have more chance to survive than second class and third class passengers. And Female survived more than Male in every classes.

Embarked

Port of Embarkation , C = Cherbourg, Q = Queenstown, S = Southampton. Categorical feature that should be encoded. We can use feature mapping or make dummy vairables for it.

However, let's explore it combining Pclass and Survivied features. So that, we can get idea about the classes of passengers and also the concern embarked.

# 'Embarked' vs 'Survived'
sns.barplot(dataset['Embarked'], dataset['Survived']);

fig 15.png

Looks like, coming from Cherbourg people have more chance to survive. But why? That's weird. Let's compare this feature with other variables.

# Count
print(dataset.groupby(['Embarked'])['PassengerId'].count())

# Compare with other variables
dataset.groupby(['Embarked']).mean()
Embarked
C    270
Q    123
S    904
Name: PassengerId, dtype: int64


    Age	Fare	Parch	PassengerId	Pclass	SibSp	Survived	male
Embarked								
C	31.242296	62.336267	0.370370	690.655556	1.851852	0.400000	0.553571	0.581481
Q	25.963415	12.409012	0.113821	668.593496	2.894309	0.341463	0.389610	0.512195
S	28.973175	26.296450	0.409292	645.971239	2.347345	0.484513	0.339117	0.683628

Oh, C passenger have paid more and travelling in a better class than people embarking on Q and S. Amount of passenger from S is larger than others. But survival probability of C have more than others.

As we've seen earlier that Embarked feature also has some missing values, so we can fill them with the most fequent value of Embarked which is S (almost 904).

# count missing values
print(dataset["Embarked"].isnull().sum()) # 2

# Fill Embarked nan values of dataset set with 'S' most frequent value
dataset["Embarked"] = dataset["Embarked"].fillna("S")

# let's visualize it to confirm
sns.heatmap(dataset.isnull(), yticklabels = False, 
            cbar = False, cmap = 'summer')

fig 16.png

And there it goes. Now, there's no missing values in Embarked feature. Let's explore this feature a little bit more. We can viz the survival probability with the amount of classes passenger embarked on different port.

# Counting passenger based on Pclass and Embarked 
Embarked_Pc = sns.factorplot("Pclass", col="Embarked",  data=dataset,
                   size=5, kind="count", palette="muted", hue = 'Survived')

Embarked_Pc.despine(left=True)
Embarked_Pc = Embarked_Pc.set_ylabels("Count")

fig 17.png

Indeed, the third class is the most frequent for passenger coming from Southampton (S) and Queenstown (Q), and but Cherbourg passengers are mostly in first class. From this, we can also get idea about the economic condition of these region on that time.

However, We need to map the Embarked column to numeric values, so that our model can digest.

# create dummy variable
embarked = pd.get_dummies(dataset['Embarked'], drop_first = True)
dataset = pd.concat([dataset,embarked], axis = 1)

# after now, we don't need Embarked coloumn anymore, so we can drop it.
dataset.drop(['Embarked'] , axis = 1 , inplace = True)

Commitment for Feature Analysis

So far, we've seen various subpopulation components of each features and fill the gap of missing values. We've done many visualization of each components and tried to find some insight of them. Though we can dive into more deeper but I like to end this here and try to focus on feature engineering.

We saw that, we've many messy features like Name, Ticket and Cabin. We can do feature engineering to each of them and find out some meaningfull insight. But, I like to work on only Name variables. Ticket is, I think not too much important for prediction task and again almost 77% data missing in Cabin variables.

However, let's have a quick look over our datasets.

dataset.head()
Age	Cabin	Fare	Name	Parch	PassengerId	Pclass	SibSp	Survived	Ticket	male	Q	S
0	22.0	NaN	7.2500	Braund, Mr. Owen Harris	0	1	3	1	0.0	A/5 21171	1	0	1
1	38.0	C85	71.2833	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	2	1	1	1.0	PC 17599	0	0	0
2	26.0	NaN	7.9250	Heikkinen, Miss. Laina	0	3	3	0	1.0	STON/O2. 3101282	0	0	1
3	35.0	C123	53.1000	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	4	1	1	1.0	113803	0	0	1
4	35.0	NaN	8.0500	Allen, Mr. William Henry	0	5	3	0	0.0	373450	1	0	1

Feature Engineering

Feature engineering is an informal topic, but it is considered essential in applied machine learning. Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.

Some resources to get more depth on it -

Feature engineering is the art of converting raw data into useful features. There are several feature engineering techniques that you can apply. Some techniques are -

  • Box-Cox transformations
  • Polynomials generation through non-linear expansions

But we don't wanna be too serious on this right now rather than simply apply feature engineering approaches to get usefull information.

Name

We can assume that people's title influences how they are treated. In our case, we have several titles (like Mr, Mrs, Miss, Master etc ), but only some of them are shared by a significant number of people. Accordingly, it would be interesting if we could group some of the titles and simplify our analysis.

Let's analyse the 'Name' and see if we can find a sensible way to group them. Then, we test our new groups and, if it works in an acceptable way, we keep it. For now, optimization will not be a goal. The focus is on getting something that can improve our current situation.

dataset['Name'].head(10)

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
5                                     Moran, Mr. James
6                              McCarthy, Mr. Timothy J
7                       Palsson, Master. Gosta Leonard
8    Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)
9                  Nasser, Mrs. Nicholas (Adele Achem)
Name: Name, dtype: object
# Get Title from Name
dataset_title = [i.split(",")[1].split(".")[0].strip() for i in dataset["Name"]]

# add dataset_title to the main dataset named 'Title'
dataset["Title"] = pd.Series(dataset_title)

# count
dataset["Title"].value_counts()


Mr              753
Miss            255
Mrs             197
Master           60
Dr                8
Rev               8
Col               4
Mlle              2
Ms                2
Major             2
Lady              1
Don               1
the Countess      1
Jonkheer          1
Dona              1
Capt              1
Mme               1
Sir               1
Name: Title, dtype: int64
# Plot bar plot (titles and Age)
plt.figure(figsize=(18,5))
sns.barplot(x=dataset['Title'], y = dataset['Age'])

fig 18.png

# Means per title
print(dataset.groupby('Title')['Age'].mean())

Title
Capt            70.000000
Col             54.000000
Don             40.000000
Dona            39.000000
Dr              42.750000
Jonkheer        38.000000
Lady            48.000000
Major           48.500000
Master           7.643000
Miss            22.261137
Mlle            24.000000
Mme             24.000000
Mr              30.926295
Mrs             35.898477
Ms              26.000000
Rev             41.250000
Sir             49.000000
the Countess    33.000000
Name: Age, dtype: float64

There is 18 titles in the dataset and most of them are very uncommon so we like to group them in 4 categories.

# Convert to categorical values Title 
dataset["Title"] = dataset["Title"].replace(['Lady', 'the Countess',
                                             'Capt', 'Col','Don', 'Dr', 
                                             'Major', 'Rev', 'Sir', 'Jonkheer',
                                             'Dona'], 'Rare')

dataset["Title"] = dataset["Title"].map({"Master":0, "Miss":1, "Ms" : 1 ,
                                         "Mme":1, "Mlle":1, "Mrs":1, "Mr":2, 
                                         "Rare":3})

dataset["Title"] = dataset["Title"].astype(int)

# Drop Name variable
dataset.drop(labels = ["Name"], axis = 1, inplace = True)
# viz counts the title coloumn
sns.countplot(dataset["Title"]).set_xticklabels(["Master","Miss-Mrs","Mr","Rare"]);

fig 19.png

# Let's see, based on title what's the survival probability
sns.barplot(x='Title', y='Survived', data=dataset);

fig 20.png

Catching Aspects:

  • People with the title 'Mr' survived less than people with any other title.
  • Titles with a survival rate higher than 70% are those that correspond to female (Miss-Mrs)

Our new category, 'Rare', should be more discretized. As we can see by the error bar (black line), there is a significant uncertainty around the mean value. Probably, one of the problems is that we are mixing male and female titles in the 'Rare' category. We should proceed with a more detailed analysis to sort this out. Also, the category 'Master' seems to have a similar problem. For now, we will not make any changes, but we will keep these two situations in our mind for future improvement of our data set.

From now on, there's no Name features and have Title feature to represent it.

# viz top 5
dataset.head()

Age	Cabin	Fare	Parch	PassengerId	Pclass	SibSp	Survived	Ticket	male	Q	S	Title
0	22.0	NaN	7.2500	0	1	3	1	0.0	A/5 21171	1	0	1	2
1	38.0	C85	71.2833	0	2	1	1	1.0	PC 17599	0	0	0	1
2	26.0	NaN	7.9250	0	3	3	0	1.0	STON/O2. 3101282	0	0	1	1
3	35.0	C123	53.1000	0	4	1	1	1.0	113803	0	0	1	1
4	35.0	NaN	8.0500	0	5	3	0	0.0	373450	1	0	1	2

Family size

I like to create a Famize feature which is the sum of SibSp , Parch.

# Create a family size descriptor from SibSp and Parch
dataset["Famize"] = dataset["SibSp"] + dataset["Parch"] + 1

# Drop SibSp and Parch variables
dataset.drop(labels = ["SibSp",'Parch'], axis = 1, inplace = True)

# Viz the survival probabily of Famize feature

facet = sns.FacetGrid(dataset, hue="Survived",aspect=4)
facet.map(sns.kdeplot,'Famize',shade= True)
facet.set(xlim=(0, dataset['Famize'].max()))
facet.add_legend()
plt.xlim(0);

fig 21.png

Survival probability is worst for large families.

Cabin & Ticket

Now, Cabin feature has a huge data missing. So, I like to drop it anyway. Moreover, we also can't get to much information by Ticket feature for prediction task.

# drop some useless features
dataset.drop(labels = ["Ticket",'Cabin','PassengerId'], axis = 1, 
             inplace = True)

Predictive Modeling

Continue with Part 2.

Next, We’ll be building predictive model. We'll use cross validation on some promosing machine learning models. Then we will do hype-parameter tuning on some selected machine learning models and end up with ensembling the most prevalent ml algorithms.

However, you can get the source code of today’s demonstration from the link below and can also follow me on GitHub for future code updates. Source Code : Titanic:ML


Say Hi On: Email | LinkedIn | Quora | GitHub | Medium | Twitter | Instagram

Discover and read more posts from Mohammed Innat
get started
post commentsBe the first to share your opinion
Houda Boubaker
4 years ago

Hey Mohammed, please can you provide us with the notebook?

Mohammed Innat
4 years ago
Simon
4 years ago

Thanks for the detail explanations! You’ve done a great job! I am interested to see your final results, the model building parts!

Mohammed Innat
4 years ago

Thanks. The second part already has published. Part 2

Show more replies