Codementor Events

Analyzing Text Classification Techniques on Youtube Data

Published Jul 23, 2019
Analyzing Text Classification Techniques on Youtube Data

The Process

Text Classification is a classic problem that Natural Language Processing (NLP) aims to solve which refers to analyzing contents of raw text and deciding which category it belongs to. It is similar to someone reading a Robin Sharma book and classifying it as ‘garbage’. It has broad applications such as sentiment analysis, topic labeling, spam detection, and intent detection.
Today, we shall take up a fairly simple task to classify a video into different classes based on its title and description using different Techniques (Naive Bayes, Support Vector Machines, Adaboost, and LSTM) and analyzing their performance. These classes are chosen to be(but are not limited to):

  • Travel Blogs
  • Science and Technology
  • Food
  • Manufacturing
  • History
  • Art and Music

Without further ado, like a middle-aged dad just getting into gardening would say, ‘Let’s get our hands dirty!’.

Gathering Data

When working on a custom machine learning problem such as this, I find it very useful, if not simply satisfying, to collect my own data. For this problem, I need some metadata about videos belonging to different categories. If you are a bit of a moronic fellow, I welcome you to manually collect the data and construct the dataset. I, however, am not, so I will use the Youtube API v3. It was created by Google itself to interact with Youtube through a piece of code specifically for programmers like us. Head over to the Google Developer Console, Create a sample project and get started. The reason I chose to go with this was that I needed to collect thousands of samples, which I didn’t find possible using any other technique.

Note: The Youtube API, like any other API offered by Google, works on a quota system. Each email is provided with a set quota per day/month depending on the plan you take. In the free plan which I had, I was only able to make requests to Youtube around 2000 times, which posed a bit of problem, but I overcame it using multiple email accounts.

The documentation for the API is pretty straight forward, and after using over 8 email accounts to compensate the required quota, I collected the following data and stored it in a .csv file. If you wish to use this dataset for your own projects, you can download it here.

Note: You are free to explore a technique known as Web Scraping, which is used to extract data from websites. Python has a beautiful library called BeautifulSoup for the same purpose. However, I found that in case scraping data from Youtube search results, it only returns 25 results for one search query. This was a dealbreaker for me since I need a lot of samples in order to create an accurate model, and this was just not going to cut it.

Data Cleaning and Pre-processing

The first step of my data pre-processing process is to handle the missing data. Since the missing values are supposed to be text data, there is no way to impute them, thus the only option is to remove them. Fortunately, there exist only 334 missing values out of 9999 total samples, so it would not affect model performance during training. The ‘Video Id’ column is not really useful for our predictive analysis, and thus it would not be chosen as part of the final training set, so we do not have any pre-processing steps for it.

There are 2 columns of importance here, namely — Title and Description, but they are unprocessed raw texts. Therefore, in order to filter out the noisiness, we’ll follow a very common approach for cleaning the text of these 2 columns. This approach is broken down into the following steps:

  1. Converting to Lowercase: This step is performed because capitalization does not make a difference in the semantic importance of the word. Eg. ‘Travel’ and ‘travel’ should be treated as the same.
  2. Removing numerical values and punctuations: Numerical values and special characters used in punctuation($,! etc.) do not contribute to determining the correct class
  3. Removing extra white spaces: Such that each word is separated by a single white space, else there might be problems during tokenization
  4. Tokenizing into words: This refers to splitting a text string into a list of ‘tokens’, where each token is a word. For example, the sentence ‘I have huge biceps’ will be converted to [‘I’, ‘have’, ‘huge’, ‘biceps’].
  5. Removing non-alphabetical words and ‘Stop words’: ‘Stop words’ refer to words like and, the, is etc, which are important words when learning how to construct sentences, but of no use to us for predictive analytics.
  6. Lemmatization: Lemmatization is a pretty rad technique which converts similar words to their base meaning. For example, the words ‘flying’ and ‘flew’ will both be converted into their simplest meaning ‘fly’.

“The text is clean now, hurray! Let’s pop a bottle of champagne to celebrate!”. No, not yet. Even though computers today are able to solve the issues of the world and play hyper-realistic video games, they are still machines who do not understand our language. Thus, we cannot feed our text data as it is to our machine learning models, no matter how clean it is. Thus we need to convert them into numerical based features such that the computer can construct a mathematical model as a solution. This constitutes the data pre-processing step

Since the output variable(‘Category’) is also categorical in nature, we need to encode each class as a number. This is called Label Encoding.
Finally, let’s pay attention to the main piece of information for each sample — the raw text data. In order to extract data from the text as features and represent them in a numerical format, a very common approach is to vectorize them. The Scikit-learn library contains the ‘TF-IDFVectorizer’ for this very purpose. TF-IDF(Term Frequency-Inverse Document Frequency) calculates the frequency of each word inside and across multiple documents in order to identify the importance of each word.

Data Analysis and Feature Exploration

As an additional step, I have decided to show the distribution of classes so check for an imbalanced number of samples.

Also, I wanted to check if the features extracted using TF-IDF vectorization made any sense, therefore I decided to find the most correlated unigrams and bigrams for each class using both the Titles and the Description features.

# USING TITLE FEATURES
# 'art and music':
Most correlated unigrams:
------------------------------
. paint
. official
. music
. art
. theatre
Most correlated bigrams:
------------------------------
. capitol theatre
. musical theatre
. work theatre
. official music
. music video
# 'food':
Most correlated unigrams:
------------------------------
. foods
. eat
. snack
. cook
. food
Most correlated bigrams:
------------------------------
. healthy snack
. snack amp
. taste test
. kid try
. street food
# 'history':
Most correlated unigrams:
------------------------------
. discoveries
. archaeological
. archaeology
. history
. anthropology
Most correlated bigrams:
------------------------------
. history channel
. rap battle
. epic rap
. battle history
. archaeological discoveries
# 'manufacturing':
Most correlated unigrams:
------------------------------
. business
. printer
. process
. print
. manufacture
Most correlated bigrams:
------------------------------
. manufacture plant
. lean manufacture
. additive manufacture
. manufacture business
. manufacture process
# 'science and technology':
Most correlated unigrams:
------------------------------
. compute
. computers
. science
. computer
. technology
Most correlated bigrams:
------------------------------
. science amp
. amp technology
. primitive technology
. computer science
. science technology
# 'travel':
Most correlated unigrams:
------------------------------
. blogger
. vlog
. travellers
. blog
. travel
Most correlated bigrams:
------------------------------
. viewfinder travel
. travel blogger
. tip travel
. travel vlog
. travel blog
# USING DESCRIPTION FEATURES
# 'art and music':
Most correlated unigrams:
------------------------------
. official
. paint
. music
. art
. theatre
Most correlated bigrams:
------------------------------
. capitol theatre
. click listen
. production connexion
. official music
. music video
# 'food':
Most correlated unigrams:
------------------------------
. foods
. eat
. snack
. cook
. food
Most correlated bigrams:
------------------------------
. special offer
. hiho special
. come play
. sponsor series
. street food
# 'history':
Most correlated unigrams:
------------------------------
. discoveries
. archaeological
. history
. archaeology
. anthropology
Most correlated bigrams:
------------------------------
. episode epic
. epic rap
. battle history
. rap battle
. archaeological discoveries
# 'manufacturing':
Most correlated unigrams:
------------------------------
. factory
. printer
. process
. print
. manufacture
Most correlated bigrams:
------------------------------
. process make
. lean manufacture
. additive manufacture
. manufacture business
. manufacture process
# 'science and technology':
Most correlated unigrams:
------------------------------
. quantum
. computers
. science
. computer
. technology
Most correlated bigrams:
------------------------------
. quantum computers
. primitive technology
. quantum compute
. computer science
. science technology
# 'travel':
Most correlated unigrams:
------------------------------
. vlog
. travellers
. trip
. blog
. travel
Most correlated bigrams:
------------------------------
. tip travel
. start travel
. expedia viewfinder
. travel blogger
. travel blog

Modeling and Training

The four models we will be analyzing are:

  • Naive Bayes Classifier
  • Support Vector Machine
  • Adaboost Classifier
  • LSTM

The dataset is split into Train and Test sets with a split ratio of 8:2. Features for Title and Description are computed independently and then concatenated to construct a final feature matrix. This is used to train the classifiers(except LSTM).

For using LSTM, the data pre-processing step is pretty different as discussed before. Here is the process for that:

  1. Combine Title and Description for each sample into a single sentence
  2. Tokenize the combined sentence into padded sequences: Each sentence is converted into a list of tokens, each token is assigned a numerical id and then each sequence is made the same length by padding shorter sequences, and truncating longer sequences.
  3. One-Hot Encoding the ‘Category’ variable

The learning curves for the LSTM are given below:


Analyzing Performance

Following are the Precision-Recall Curves for all the different classifiers. To get additional metrics, check out the Complete code on Github.




The ranking of each classifier as observed in our project is as follows:

LSTM > SVM > Naive Bayes > AdaBoost

LSTMs have shown stellar performance in multiples tasks in Natural Language Processing, including this one. The presence of multiple ‘gates’ in LSTMs allows them to learn long term dependencies in sequences. 10 points to Deep Learning!
SVMs are highly robust classifiers which try their best to find interactions between our extracted features, but the learned interactions are not at par with the LSTMs. Naive Bayes Classifier, on the other hand, considers the features as independent, thus it performs a little worse than SVMs since it does not take into account any interactions between different features.
The AdaBoost classifier is quite sensitive to the choice of hyperparameters, and since I have used the default model, it does not have the most optimal parameters which might be the reason for the poor performance

Final Thoughts

I hope this has been as informative for you as much it has been for me. The complete code can be found on my Github.

Ciao

Discover and read more posts from Rohit Agrawal
get started
post commentsBe the first to share your opinion
Show more replies