Multiclass Classification using Random Forest on Scikit-Learn Library
I still remember my first time reading machine learning code by an expert and feeling like a helpless victim. When I opened it up, I was hit with huge chunks of code without any comments, killing any sense of enthusiasm I may have felt.
Well, there's good news: creating a Machine Learning model in Python doesn't have to be that daunting. With the right skills and tools at your disposal, you could easily create a fully working model with high accuracy — all without huge budgets or hiring contractors.
What is in the post?
We are going to predict the species of the Iris Flower using Random Forest Classifier. The dependent variable (species) contains three possible values: Setoso, Versicolor, and Virginica. This is a classic case of multi-class classification problem, as the number of species to be predicted is more than two. We will use the inbuilt Random Forest Classifier function in the Scikit-learn Library to predict the species.
Why MultiClass classification problem using scikit?
Most real world machine learning applications are based on multi-class Classification algorithms (ie. Object Detection, Natural Language Processing, Product Recommendations).
For beginners to machine learning and/or coding systems, scikit-library provides easy to use functions to perform the complex tasks involved in machine learning, such as: calculation of cost function, gradient descent, and feature importance calculations, which helps users grasp the Machine Learning applications without going very deeply into the math and calculations involved.
Although an understanding of the underlying mathematics is important for understanding machine learning algorithms, with the help of available libraries, it is not necessary for implementation.
How are we solving the issue?
A good multi-class classification machine learning algorithm involves the following steps:
- Importing libraries
- Fetching the dataset
- Creating the dependent variable class
- Extracting features and output
- Train-Test dataset splitting (may also include validation dataset)
- Feature scaling
- Training the model
- Calculating the model score using the metric deemed fit based on the problem
- Saving the model for future use
1/9. Importing Libraries
We are going to import three libraries for our code:
- Pandas: One of the most popular libraries for data manipulation and storage. This is used to read/write the dataset and store it in a dataframe object. The library also provides various methods for dataframe transformation.
- Numpy: The library used for scientific computing. Here we are using the function vectorize for reversing the factorization of our classes to text.
- Sklearn: The library is used for a wide variety of tasks, i.e. dataset splitting into test and train, training the random forest, and creating the confusion matrix.
#Importing Libraries import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import confusion_matrix from sklearn.externals import joblib print('Libraries Imported')
2/9. Fetching dataset
I used the dataset of iris from here for classification. I renamed the dataset from 'iris.data' to 'iris.data.csv' and stored it in the same folder as the Python script.
The code below will perform the following functionality:
- Store the data without colnames in dataframe named 'dataset'.
- Rename the columns to ['sepal length in cm', 'sepal width in cm','petal length in cm','petal width in cm','species'].
- Show the first five records of the dataset.
#Creating Dataset and including the first row by setting no header as input dataset = pd.read_csv('iris.data.csv', header = None) #Renaming the columns dataset.columns = ['sepal length in cm', 'sepal width in cm','petal length in cm','petal width in cm','species'] print('Shape of the dataset: ' + str(dataset.shape)) print(dataset.head())
Shape of the dataset: (150, 5)
|sepal length in cm||sepal width in cm||petal length in cm||petal width in cm||species|
3/9. Creating the dependent variable class
We are basically converting species column values from ['Iris-setosa','Iris-versicolor','Iris-virginica'] to [0,1,2]. This is an essential step as the scikit-learn's Random Forest can't predict text — it can only predict numbers.
Also, we need to store the factor conversions to remember what number is substituting the text.
The code below will perform the following:
- Use pandas factorize function to factorize the species column in the dataset. This will create both factors and the definitions for the factors.
- Store the factorized column as species.
- Store the definitions for the factors.
- Show the first five rows for the species column and the defintions array.
#Creating the dependent variable class factor = pd.factorize(dataset['species']) dataset.species = factor definitions = factor print(dataset.species.head()) print(definitions)
Name: species, dtype: int64
Index(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype='object')
4/9. Extracting Features and Output
We need to split the dataset into independent and dependent variables. In our dataset, the first four columns are independent variables, whereas the last column, 'species', is the dependent variable.
Also, we need to convert these values from a dataframe to array for future use.
#Splitting the data into independent and dependent variables X = dataset.iloc[:,0:4].values y = dataset.iloc[:,4].values print('The independent features set: ') print(X[:5,:]) print('The dependent variable: ') print(y[:5])
The independent features set:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]
The dependent variable:
[0 0 0 0 0]
5/9. Train-Test Data Splitting
We are going to use 75% of the data for training and the remaining 25% as test data (i.e., 75% of 150 rows as 112 rows for training and 38 rows for testing). We are not going to create cross validation datasets, as they are used when hyperparameter training is involved.
Also, the reason for such high number of test case percentages is due to fewer numbers of rows for the model. Generally, 80/20 rule for train-test is used when data is sufficiently high.
The below code uses the prebuilt function 'train_test_split' in a sklearn library for creating the train and test arrays for both independent and dependent variable. Also, random_state = 21 is assigned for random distribution of data.
# Creating the Training and Test set from data X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 21)
6/9. Feature Scaling
This is a very important step in machine learning. It helps the algorithm quickly learn a better solution to the problem.
We will use a standard scaler provided in the sklearn library. It subtracts the mean value of the observation and then divides it by the unit variance of the observation.
We will perform the following steps:
- Define a scaler by calling the function from sklearn library.
- Transform train feature dataset (X_train) and fit the scaler on train feature dataset.
- Use the scaler to transform test feature dataset (X_test).
# Feature Scaling scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test)
7/9. Training the model
We define the parameters for the random forest training as follows:
- n_estimators: This is the number of trees in the random forest classification. We have defined 10 trees in our random forest.
- criterion: This is the loss function used to measure the quality of the split. There are two available options in sklearn — gini and entropy. We have used entropy.
- random_state: This is the seed used by the random state generator for randomizing the dataset.
Next, we use the training dataset (both dependent and independent to train the random forest)
# Fitting Random Forest Classification to the Training set classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 42) classifier.fit(X_train, y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='entropy',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
oob_score=False, random_state=42, verbose=0, warm_start=False)
8/9. Evaluating the performance
Performance evaluation of the trained model consists of following steps:
- Predicting the species class of the test data using test feature set (X_test). We will use the predict function of the random forest classifier to predict classes.
- Converting the numeric classes of the predicted values and the test actual values into textual equivalent. This involves the following steps:
- Creating dictionary for mapping tables from class to text — we use dict function along with zip to create the required dictionary.
- Transforming the test-actual and test-predict database from numeric classes to textual classes.
- Evaluating the performance of the classifier using Confusion Matrix.
# Predicting the Test set results y_pred = classifier.predict(X_test) #Reverse factorize (converting y_pred from 0s,1s and 2s to Iris-setosa, Iris-versicolor and Iris-virginica reversefactor = dict(zip(range(3),definitions)) y_test = np.vectorize(reversefactor.get)(y_test) y_pred = np.vectorize(reversefactor.get)(y_pred) # Making the Confusion Matrix print(pd.crosstab(y_test, y_pred, rownames=['Actual Species'], colnames=['Predicted Species']))
9/9. Storing the trained model
We are going to observe the importance for each of the features and then store the Random Forest classifier using the joblib function of sklearn.
print(list(zip(dataset.columns[0:4], classifier.feature_importances_))) joblib.dump(classifier, 'randomforestmodel.pkl')
[('sepal length in cm', 0.13838770253303928), ('sepal width in cm', 0.006840004111259038), ('petal length in cm', 0.43430955033126234), ('petal width in cm', 0.4204627430244394)]