Convolutional Neural Networks: The Biologically-Inspired Model
Can you recognize the people in the picture above? If you’re a fantasy fan, you’ll instantly recognize that they are Harry, Ron, and Hermione from J.K. Rowling’s worldwide phenomenon Harry Potter book series. The picture is a scene from Part 1 of Harry Potter and the Deathly Hallows, in which Harry, Ron, and Hermione are interrogating the thief, Mundungus Fletcher, captured moments earlier by the elves, Dobby and Kreacher. I know this because I’ve watched the movie and read the book so many times! But what would it take for a computer to understand this image? Let’s think explicitly of all of the pieces of knowledge that have to fall in place for it to make sense:
- You recognize that it is an image of a bunch of people and understand that they are in a room.
- You recognize that there are stacks of newspapers and a long wooden table with chairs, so the location is most likely a kitchen or a living room.
- You recognize Harry Potter from the few pixels that make up the glasses in his face. It helps that he has black hair as well.
- Similarly, you recognize Ron Weasley because of his red hair and Hermione Granger because of her long hair.
- You recognize the other man who is bald and wears old-fashioned clothing, which suggests he is much older.
- You recognize the other two creatures (Dobby and Kreacher) are not human, even if you’re unfamiliar with them. You’ve used their heights, facial structure, body measurement, in addition to your knowledge of normal people’s looks, to figure it out.
- Harry, Ron, and Hermione are interrogating the bald man. You derive this because you know their body posture is leaning towards him. You sense their doubtful expressions and see Hermione holding a wand in her hand (a wand is the magic weapon in the wizarding world).
- You understand that the bald man is feeling scared. You understand that he is hiding something, given that his hands are covering his chest. You start to reason about the implications of the events that are about to unfold seconds after this scene, and become curious about what secrets will be revealed.
- The two creatures are looking towards Harry and Ron. It looks like they are trying to say something. In other words, you are reasoning about those creatures’ states of mind. Whoa, you can be a mind-reader!
I could go on, but the point here is that you’ve used a huge amount of information in that second when you look at the picture. Information about the 2D and 3D structure of the scene, visual elements like people’s identities, their actions, and even their thoughts. You think about the dynamics of the scene and guess what will happen next. All of these things come together for you to make sense of the scene.
It is incredible how human brains can unfold an image consisting of just arrays of R,G,B values. How about computers? How can we begin to write an algorithm that can reason about the scene like I just did above? How can we get the right data that can support the inferences we make?
The field of Computer Vision tackles this exact problem, as machine learning researchers have focused extensively on object detection problems over time. There are various things that make it hard to recognize objects: image segmentation/deformation, lighting, affordances, viewpoint, huge dimensions, etc. In particular, Computer Vision researchers use neural networks to solve complex object recognition problems by chaining together a lot of simple neurons. In a traditional feed-forward neural network, the images are fed into the net and the neurons process the images and classify them into the outputs of True and False likelihood. Sounds simple, doesn’t it?
But what if the images are deformed, just like the number digits above? Feed-forward neural nets only works well when the digit is right in the middle of the image, but fails spectacularly when the digit is slightly off position. In other words, the net knows only one pattern. This is clearly not useful in the real world, as real datasets are always dirty and unprocessed. As such, we need to improve our neural network in cases where the input images aren’t perfect.
Thankfully, Convolutional Neural Nets come to the rescue!
The Convolution Process
So what exactly is a Convolutional Neural Network? According to Chris Olah, Research Scientist at Google Brain:
“At its most basic, convolutional neural networks can be thought of as a kind of neural network that uses many identical copies of the same neuron. This allows the network to have lots of neurons and express computationally large models while keeping the number of actual parameters — the values describing how neurons behave — that need to be learned fairly small.”
Source: A 2D CNN - http://colah.github.io/posts/2014-07-Conv-Nets-Modular/
Note the term being used there: identical copies of the same neuron. This is loosely based on how the human brain works. By using the same brain memory spot, humans can spot and recognize patterns without having to re-learn the concept. For example, we recognize the identity of the digits above no matter what angle we look at them from. The feed-forward neural network can’t do this. But a Convolutional Neural Net can because it understands translation invariance, where it recognizes an object as an object, even when its appearance varies in some way.
The convolution process works like this:
- First, CNN uses a sliding window search to break an image into overlapping image tiles.
- Then, CNN feeds each image tile into a small neural network, using the same weights for each tile.
- CNN then saves the results from each tile into a new output array.
- After that, CNN down-samples the output array to reduce its size.
- Last but not least, after reducing a big image down into a small array, CNN predicts whether the image is a match or not.
There’s a fantastic tutorial by Adam Geitgey that goes into much more detail about how the convolution process works. I definitely suggest you check it out.
CNNs was popularized mostly thanks to the efforts of Yann LeCun, now the Director of AI Research at Facebook. In the early 1990s, LeCun worked at Bell Labs, one of the most prestigious research labs in the world at the time, and built a check-recognition system to read handwritten digits. There’s a very cool video from 1993 where LeCun showed how the system works right here. This system was actually an entire process for doing end-to-end image recognition. The resulting paper, that he co-authored with Leon Bottou, Patrick Haffner, and Yoshua Bengio in 1998, introduces convolutional nets as well as the full end-to-end system they built. It’s quite a long paper, so I’ll summarize it quickly here. The first half describes convolutional nets, shows its implementation, and mentions everything else related to the technique (which I’ll cover in the CNN Architecture section below). The second half shows how to integrate convolutional nets with language models. For example, as you read a piece of English text, you can build a system on top of the English grammar to extract the most likely interpretation that is part of the language. The big takeaway is that you can build a CNN system and train it to simultaneously do recognition and segmentation, and provide the right input for the language model.
Let’s discuss the architecture of a Convolutional Neural Network. There is an input image that we’re working with. We perform a series convolution + pooling operations, followed by a number of fully connected layers. If we are performing multiclass classification, the output is softmax. There are four basic building blocks in every CNN: Convolution Layer, Non-Linearity (ReLU activation used in the CNN Layer), Pooling Layer, and Fully-Connected Layer.
1 - Convolution Layer
Here we extract features from the input image:
- We preserve the spatial relationship between pixels by learning image features using small squares of input data. These squares of input data are also called filters or kernels.
- The matrix formed by sliding the filter over the image and computing the dot product is called a Feature Map. The more filters we have, the more image features get extracted and the better our network becomes at recognizing patterns in unseen images.
- The size of our feature map is controlled by depth (the number of filters used), stride (the number of pixels slid over the input matrix), and zero-padding (padding the input matrix with 0s around the border).
2 - Non-Linearity:
For any kind of neural network to be powerful, it needs to contain non-linearity. LeNet uses sigmoid non-linearity, which takes a real-valued number and squashes it into a range between 0 and 1. In particular, large negative numbers become 0 and large positive numbers become 1. However, the sigmoid non-linearity has a couple of major drawbacks: (i) sigmoids saturate and kill gradients, (ii) sigmoids have slow convergence, and (iii) sigmoid outputs are not zero-centered.
A more powerful non-linear operation is ReLU, which stands for Rectified Linear Unit. It is an element wise operation that replaces all negative pixel values in the feature map with 0. We pass the result from the convolution layer through a ReLU activation function. Almost all CNN-based architectures developed later used ReLU, as in the case of AlexNet I discussed below.
3 - Pooling Layer
After this, we perform a pooling operation to reduce the dimensionality of each feature map. This enables us to reduce the number of parameters and computations in the network, therefore controlling overfitting.
CNN uses max-pooling, in which it defines a spatial neighborhood and takes the largest element from the rectified feature map within that window. After the pooling layer, our network becomes invariant to small transformations, distortions, and translations in the input image.
4 - Fully-Connected Layer
After the convolution and pooling layers, we add a couple of fully-connected layers to wrap up the CNN architecture. The output from the convolution and pooling layers represent high-level features of the input image. The FC layers use these features for classifying the input image into various classes based on the training dataset. Apart from classification, adding FC layers also helps to learn non-linear combinations of these features.
From a bigger picture, a CNN architecture accomplishes two major tasks: feature extraction (convolution + pooling layers) and classification (fully-connected layers). In general, the more convolution steps we have, the more complicated features our network will be able to learn to recognize.
CNNs have been remodeled in a variety of forms for different contexts of natural language processing, computer vision, and speech recognition. I will cover some notable industry applications later on in this post, but first, let’s discuss CNN’s usage in computer vision. Recognizing real objects in color photographs downloaded from the web is much more complicated than recognizing handwritten digits. There are a hundred times as many classes, a hundred times as many pixels, two dimensional images of three-dimensional scene, cluttered scenes requiring segmentation, and multiple objects in each image. How will CNN evolve to cope with these challenges?
In 2012, Stanford University Computer Vision group organized the ILSVRC-2012 competition (ImageNet Large Scale Visual Recognition Challenge) — one of the largest challenges in Computer Vision. It is based on ImageNet, a dataset with approximately 1.2 million high-resolution training images. Test images are presented with no initial annotation and algorithms will have to produce labelings specifying what objects are present in the images. Every year since then, teams from leading universities, startups, and big companies have competed to claim state-of-the-art performance on the dataset.
The winner of that first competition, Alex Krizhevsky (NIPS 2012), built a very deep convolutional neural net of the type pioneered by Yann LeCun (known as AlexNet). Compared to LeNet, AlexNet is deeper, has more filters per layer, and is also equipped with stacked convolutional layers. Looking at AlexNet's architecture below, you can identify the main differences between it and LeNet:
- The number of processing and trainable layers: AlexNet includes five convolutional layers, three max-pooling layers, and three fully-connected layers. LeNet only has two convolutional layers, two max-pooling layers, and three fully-connected layers.
- ReLU Non-Linearity: AlexNet used ReLU whereas LeNet uses a logistic sigmoid. ReLU helps decrease training time for AlexNet, as it is several times faster than the conventional logistic sigmoid function.
- The use of dropout: AlexNet uses dropout layers to combat the problem of overfitting to the training data. LeNet doesn’t use such a concept.
- Diverse dataset: While LeNet was only trained to recognize handwritten digits, AlexNet was trained to work with the ImageNet data, which is much richer in terms of dimensions, colors, angles, semantics, etc.
AlexNet became the pioneering “deep” CNN that won the competition with 84.6% accuracy, while the second-place model (which still used the traditional techniques in LeNet instead of deep architectures), only achieved 73.8% accuracy rate.
Since then, this competition has become the benchmark arena where state-of-the-art computer vision models are introduced. In particular, there have been many competing models using deep Convolutional Neural Nets as their backbone architecture. The most popular ones that achieved excellent results in the ImageNet competition include: ZFNet (2013), GoogLeNet (2014), VGGNet (2014), ResNet (2015), DenseNet (2016), etc. These architectures were getting deeper and deeper year by year.
CNN architectures continue to feature prominently in Computer Vision, with architectural advancements providing improvements in speed, accuracy, and training for many of the applications and tasks mentioned below:
- In Object Detection, CNN is the major architecture behind the most popular models, such as: R-CNN, Fast R-CNN, Faster R-CNN. In these models, the net hypothesize object regions and then classifies them, using the CNN on top each of these region proposals. This is now the predominant pipeline for many object detection models, deployed in autonomous vehicles, smart video surveillance, facial detection, etc.
- In Object Tracking, CNNs have been used extensively in visual tracking applications. For example, given a CNN pre-trained on a large-scale image repository offline, this online visual tracking algorithm developed by the team at Pohang Institute in Korea can learn discriminative saliency maps to visualize a target spatially and locally. Another instance is DeepTrack, a solution to automatically relearn the most useful feature representations during the tracking process in order to accurately adapt appearance changes, pose, and scale variations while preventing drift and tracking failures.
- In Object Recognition, the team from INRIA and MSR in France developed a weakly supervised CNN for object classification that relies only on image-lvil labels, yet can learn from cluttered scenes containing multiple objects. Another instance is FV-CNN, a texture descriptor developed by people from Oxford to solve the clutter problem in texture recognition.[
- In Semantic Segmentation, Deep Parsing Network is a CNN-based net developed by a group of researchers from Hong Kong to incorporate rich information into an image segmentation process. UC Berkeley’s researchers, on the other hand, built fully-convolutional networks and exceeded state-of-the-art semantic segmentation. Recently, SegNet is a deep fully convolutional neural network that is extremely efficient in terms of memory and computational time for semantic pixel-wise segmentation.
- In Video and Image Captioning, the most important invention has been UC Berkeley’s Long-Term Recurrent Convolutional Nets, which incorporate both CNNs and RNNs (Recurrent Neural Nets) to tackle large-scale visual understanding tasks including activity recognition, image captioning, and video descriptions. It has been deployed heavily by the Data Science team at YouTube to make sense of the huge amount of videos uploaded to the platform daily.
CNNs have also found many novel applications outside of Vision, notably Natural Language Processing and Speech Recognition:
- Natural Language Processing: In the domain of Machine Translation, the AI Research team at Facebook used CNNs to achieve state-of-the-art accuracy at nine times the speed of recurrent neural systems. In the domain of Sentence Classification, Yoon Kim at NYU experimented with CNNs trained on top of pre-trained word vectors for sentence-level classification tasks and improved upon state-of-the-art on four out of seven tasks. In the domain of Question Answering, a few researchers from Waterloo and Maryland explored the effectiveness of CNNs for Answer Selection in end-to-end question answering. They found answers from CNNs are detectably better than previous algorithms.
- Speech Recognition: CNNs are very effective models for reducing spectral variations and modeling spectral correlations in acoustic features for automatic speech recognition. Hybrid speech recognition systems incorporating CNNs with Hidden Markov Models/Gaussian Mixture Models have achieved state-of-the-art results in various benchmarks. Researchers at the University of Montreal have proposed an end-to-end speech framework for sequence labeling, by combining hierarchical CNNs with CTC (Connectionist Temporal Classification), that is competitive with existing baseline systems. The team at Microsoft used CNNs to reduce error rate in speech recognition performance, in particular by building a CNN architecture with local connectivity, weight sharing, and pooling. Their model is capable of being invariant to speaker and environment variations.
Let’s revisit our example of the Harry Potter image again and see how I can use CNN to recognize its features:
- First, I pass a sliding window over the entire original image and save each result as a separate, tiny picture tile. By doing this, I turn the original image into multiple equally-sized tiny image tiles.
- Then, I feed each image tile into the convolution layer and keep the same neural network weights for every single tile in the same original image.
- Next, I save the results from each tile into a new array in the same arrangement as the original image.
- Then, I use max-pooling to reduce the size of the array. For instance, I can look at each 2 x 2 square of the array and keep the biggest number.
- After being downsampled, the small array then is fed into the fully-convolutional layer to make predictions, say, whether it is an image of Harry, Ron, Hermione, the elves, the newspaper, the chair etc.
- After training, I am now confident in making predictions for my image!
As you can see from this article, Convolutional Neural Networks played an important part in shaping the history of deep learning. Heavily inspired by the study of the brains, CNNs performed extremely well in commercial applications of deep learning (vision, language, speech) compared to most other neural networks. They have been used by many machine learning practitioners to win academic and industry competitions. Research into CNN architectures advances at such a rapid pace: using fewer weights/parameters, automatically learning and generalizing features from the input objects, being invariant to object position and distortion in image/text/speech, etc. Undoubtedly the most popular neural network technique, CNNs are a must-know for anyone who wants to enter the deep learning arena.