Machine Learning in Plain English: Building a Decision Tree Model to Classify Names by Gender: Part One
Machine learning is a booming field in computer science. Its focus is to train algorithms to make predictions and decisions from datasets. These datasets can either be curated or generated in real time.
In this tutorial, I'll talk about the classification problems in machine learning. By the end of it, you’ll have built a machine learning model to classify names as either male or female. Sounds like fun, right?
Okay, let’s kick things off by first asking what I mean by a “classification problem.” Basically, classification problems sort things into one of a predefined set of categories:
- Fraud or Not fraud
- Female or Male
- Rise or Fall
- Sunny or Cloudy or Rainy
The typical structure of a classification problem is that you start with what you want to classify — the problem instance — and apply labels to it, with each label being a value within the predefined set of categories.
We’re going to pick the gender detection problem from the list above for our example. So, from the earlier sentence, we know that we need to have a problem instance and two labels:
Problem instance = Name
labels = (Male, Female)
In a nutshell, the flow to building this model can be visualized as:
A name => Classifier => Male/Female
Our second objective is to find rules that can classify female and male names visually (these rules don’t necessarily have to be correct).
Taking a look at the table, we can see some patterns. We can see that more female than male names end with vowels. We can also see that a male name ending with a vowel begins with a consonant, and a female name ending with a consonant begins with a vowel.
However, these constraints are not a hundred percent accurate. Rather, they’re just assumptions. We will add more constraints as we go on, but now it’s time to get into the nitty-gritty of the code.
What you'll need:
Or you can just download Anaconda, a Python data science platform, shipped with lots of data science tools you don't have to bother to install. If you have Anaconda installed, fire it up and launch Jupyter notebook, an open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and explanatory text.
from pandas import pandas as pd from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier from sklearn.metrics import accuracy_score
The first lines are just a bunch of imports. Pandas is a data analysis tool that allows you to tabulate data in structures that have rows and columns.
Then, from the
sklearn module, which contains a bunch of machine learning algorithms and data mining tools, we import
train_test_split. What this basically does, or what we'll need it to do, is to split our single dataset into two — one for training the model and another for testing it. I'll explain the need for splitting the dataset in a bit.
We then go on to import
DecisionTreeClassifier. This is the classification algorithm we’ll use to build our decision tree.
Finally, we import
accuracy_score. This is going to be used to check the accuracy of our model.
df = pd.read_csv('./dataset/NationalNames.csv') df = df.drop_duplicates(subset="Name") df.head()
Here we import our dataset, NationalNames.csv, using the pandas
read_csv method, which allows us to read a csv file into a pandas data frame.
This dataset contains naming trends for babies born in the United States and is sanitized, but we do have to perform one more cleaning operation, which is what you see on the second line.
There, we check for and remove duplicate names from the dataset using pandas data frame
drop_duplicates method, and pass in the column we want to check.
The last line displays the first five rows of the pandas data frame.
Next, we'll select the information that we think is most relevant for our classification. What we want is to pick a column that best represents whether a name is male or female.
Unfortunately, with the existing dataset, there’s no definite way to determine whether a name is male or female, so we should go ahead and create a new column to better classify gender.
The first column we’re going to add will check if the name ends in a vowel. At first glance, we see that female names tend to end in vowels more often than consonants. This is not a definite classification rule, and in fact may not be a rule at all. But let’s try it and see.
First, we create a function that checks if the last letter of a name is a vowel.
# Check if the name ends in vowel def checkVowelEnd(name): if name[-1] in "aeiou": return "Vowel End" return "Consonant End"
If the last letter is a vowel, we return 0 (for female), otherwise we return 1 (for male). We need this value in numeric form so it can be passed into our classifier.
df["Vowel/Consonant End"] = df["Name"].apply(checkVowelEnd) df.head()
Here, we’re creating a column Vowel/Consonant End, and we’re passing the values obtained from the vowel-ending checking function into it.
Pandas data frames have an
apply method that lets us pass in a function as an argument, and that function will be called on each row of a specified data frame column. In our program, we’re applying the
checkVowelEnd function to each name in the Name column.
We should see something like this:
Next, we check to see if this is truly a good indicator of whether a name is male or female. To pursue this, we first have to convert the Gender column by assigning numerical codes to its values. As before, we'll use 0 for female and 1 for male.
def checkGender(gender): if gender == "F": return 0 return 1 df["Gender Value"] = df["Gender"].apply(checkGender) df.head()
Again, we create a new column, Gender Value, which will contain the numerical equivalent of the genders. After this action, we should have something like this:
We’ll now create a function to compare the custom column’s value with the value of the actual gender column, Gender Value.
def compare(group): return df.groupby([group])["Gender Value"].sum()*100/df.groupby([group])["Gender Value"].count()
Breaking down this function, we have:
df.groupby(["Vowel/Consonant End"])["Gender Value"].sum()
This will return the sum of all female names that end in vowels and the sum of all female names that, in fact, end in consonants. We should get this result:
From this output, we can see that
42,424 female names actually end in vowels and that
17,506 female names also end in consonants. To validate this result, we'll need to look at the sum of all female names in our dataset, i.e. how many female names we actually have in the dataset. We'll see this after we explain what the
count method does.
df.groupby(["Vowel/Consonant End"])['Gender Value'].count()
count method again gives us the sum of all names ending in consonants and the sum of all names ending in vowels. We should see the following:
From our result, we can see that, of all the female names in the dataset,
50,254 end in vowels and
43,635 end in consonants. To validate this, we'll need to look at the sum of our entire dataset — i.e. how many names we actually have in total.
print (len(df)) # > 93889 = 43635 + 50254
We can see that the length of our dataset is the sum of
50,254 (names that end with vowels) and
43635 (names that ends in consonants).
Now, to validate that we did get the correct totals, we'll take the list of initial names that are female, and summarize the numbers ending in vowels and consonants. Hopefully, our answer will equal the actual count of female names in our dataset.
female_names = sum(df.groupby(["Vowel/Consonant End"])["Gender Value"].sum()) all_names = df.groupby(["Gender"])["Gender Value"].count() print (all_names) print ("\nBoth are equal? %s" % str(female_names == all_names["F"]))
Awesome! The values are equal, so we can continue. We typically want to get the rates at which female names end in vowels and consonants, as opposed to the absolute numbers.
print(df.groupby(["Vowel/Consonant End"])["Gender Value"].sum()*100/df.groupby([group])["Gender Value"].count())
We’re just converting our values to percentages here. We multiply the sum by
100 and we divide by the total count.
compare function we created, we can do the following:
Now, that’s more like it! We’re able to see that roughly 40% of names ending with consonants, and 84% of names ending with vowels, are female. This is a somewhat good classifier, so we'll keep the column.
However, if we want to get more accurate results, this column is not enough to build a decision tree. So, let’s add another. We can check whether female names frequently start with either vowels or consonants.
def vowelConsonantStart(name): if name in "aeiou": return "Vowel Start" return "Consonant Start" df["Vowel/Consonant Start"] = df["Name"].apply(vowelConsonantStart) print("\n Comparison => %s", compare("Vowel/Consonant Start")) df.head()
If we look at the two rates above, we see that there’s not much difference, meaning this isn’t a very strong classification factor.
Next, let’s try using the length of names as a classifier.
def shortLongName(name): if len(name) < 7: return "Short" return "Long" df["Short/Long Name"] = df["Name"].apply(shortLongName) print(compare("Short/Long Name", df)) df.head()
I think the result we get from this comparison is quite reasonable, so we'll move on to training and testing our model.
training_data = df[["Gender Value", "Vowel/Consonant End", "Short/Long Vowel/Consonant End", "Vowel/Consonant Start"]] training_data.head()
We create a new data frame with training data and give it only the columns from the original data frame that we actually need.
From the looks of it, our columns mainly contain categorical variables, i.e. variables or attributes that can only take one of a specific set of values. Whenever we have such categorical variables, we need to convert them to numerical representations before we can feed the data into any machine learning algorithm.
So, we'll mark these columns to indicate that whenever we see particular values, they should be replaced by their numerical equivalents — usually 1 or 0.
Pandas comes through for us, as usual, and has a way to handle this.
def reprCategory(column): column = column.astype("category") return column.cat.codes training_data[["Vowel/Consonant End", "Short/Long Name", "Vowel/Consonant Start"]] = training_data[["Vowel/Consonant End", "Short/Long Name", "Vowel/Consonant Start"]].apply(reprCategory) training_data.head()
In the first line of the function
reprCategory, we convert the column into the type
category so that pandas knows that this is a categorical variable. The last line will return a column where every value has been converted to a corresponding numeric value (usually integers).
After creating the function, we use it to replace Vowel/Consonant End, Short/Long Name, Vowel/Consonant Start with columns containing the appropriate numerical codes generated by applying the function
reprCategory on the original columns.
When we print out the first few rows of
training_data, we will see that the columns have been converted.
train, test = train_test_split(training_data, test_size = 0.20)
Before continuing, it is useful to split our dataset in two — one that we can use to train our machine learning algorithm and the other to test the accuracy of our model.
sklearn package has a module,
model_selection, that can ease this process. As we did in the beginning, from the module, we import
train_test_split, which is the function we will use to split our dataset. It takes in two arguments — our original data and the percentage of the original data we want to convert to test data. In this case, we’re using 20% (0.20) of our original data.
clf = DecisionTreeClassifier()
sklearn package has a module called
tree that contains tree-based models, from which we imported the
DecisionTreeClassifier above. Here, we set up an instance of it.
clf = clf.fit(train[["Vowel/Consonant End", "Short/Long Name", "Vowel/Consonant Start"]], train["Gender Value"])
DecisionTreeClassifier has a
fit method that will take in the
training_data with features and labels, apply a machine learning algorithm to it and return another
DecisionTreeClassifier object. This object will have the decision tree that has been built embedded within it.
fit method will take in two data frames as arguments, the first containing all the features we think are important for our classification, and the second, the corresponding labels.
By printing out the classifier object, we can see a bunch of its properties. I will explain what each property does and how they can be used to increase the accuracy of our model in the next part.
feature_importances_ attribute gives us a list of scores between 0 and 1 for each of our features, with the most important feature scoring the highest. We can see that the Vowel/Consonant End feature is ranked here as being the most important.
predictions = clf.predict(test[["Vowel/Consonant End", "Short/Long Name", "Vowel/Consonant Start"]]) accuracy_score(test["Gender Value"], predictions)
Next, we attempt to gauge the accuracy of our model. We first try to use our model to predict the gender of a name using the test data. The
predict module lets us pass in our data frame, and it tries to make predictions based on the data we fed into our machine learning algorithm.
Next, using the
accuracy_score function, we pass in two parameters — the first is the list of actual values for each row in our test data, and the second is the list of values that our model has predicted. After matching each original value to the predicted value,
accuracy_score returns either 1.0 or 0.0.
Since we pass in a list of data, it matches each row's value to its corresponding prediction, aggregates it, and returns a value between 0.0 and 1.0. If
accuracy_score returns a value of exactly 0.0 or 1.0 for a list of data, then something is seriously wrong with our model. Here, we can see that my model has an accuracy score of roughly 73%. Yours may vary, but only slightly.
with open("decidenames.dot", "w") as dot_file: dot_file = export_graphviz(clf, feature_names=["Vowel/Consonant End", "Short/Long Name", "Vowel/Consonant Start"], out_file=dot_file)
It’s now time to visualize our decision tree. We open the
decidenames.dot file for writing, and we write our decision tree to it using the
export_graphviz function we imported at the start.
This takes in our classifier as the first parameter,
feature_names (a list of all our featured names) as the second, and finally,
out_file — the file we want to write to. After this, we should see the
decidenames.dot file in our filesystem. It should look like the image above.
From the graphic, we can see that it’s actually not a very complex decision tree. This is because we don’t have many features of similar importance.
What is basically going on here is that Vowel/Consonant End has been placed at the top, as the root node — the most important classification feature. If a name ends in a consonant, then it moves on to the next most important feature or attribute, which, as we can see, is the length of the name.
The same goes for vowels, but it’s important to note that in some situations, the next node or feature on a specific branch may not be the same as on another. For instance, it could be that, for a name ending in a vowel, the next most important factor is whether it starts with a vowel or consonant.
The image below labels each branches in the decision tree.
In the next part, we'll look into what each of our classifier properties does, and how they can be used to improve the accuracy of our models. We'll look into what we call “ensemble learning,” how it helps us solve a problem known as “overfitting,” and how we can use it to improve the accuracy of our model.