# A Gentle Introduction to Neural Networks for Machine Learning

**Why Do We Need Machine Learning?**

We need machine learning for tasks that are too complex for humans to code directly, i.e. tasks that are so complex that it is impractical, if not impossible, for us to work out all of the nuances and code for them explicitly. So instead, we provide a machine learning algorithm with a large amount of data and let it explore and search for a model that will work out what the programmers have set out to achieve.

Let’s look at these two examples:

- It’s very hard to write programs that solve problems like recognizing a 3D object, from a novel viewpoint, in new lighting conditions, in a cluttered scene. We don’t know what program to write because we don’t know how it’s done in our brain. Even if we had a good idea for how to do it, the program might be horrendously complicated.
- It’s hard to write a program to compute the probability that a credit card transaction is fraudulent. There may not be any rules that are both simple and reliable. We need to combine a very large number of weak rules. Fraud is a moving target, but the program needs to keep changing.

Then comes the **Machine Learning Approach**: instead of writing a program by hand for each specific task, we collect lots of examples that specify the correct output for a given input. A machine learning algorithm then takes these examples and produces a program that does the job. The program produced by the learning algorithm may look very different from a typical hand-written program — it may contain millions of numbers. If we do it right, the program works for new cases, as well as the ones we trained it on. If the data changes, the program can change too by training from the new data. You should note that conducting massive amounts of computation is now cheaper than paying someone to write a task-specific program.

Some examples of tasks best solved by machine learning include:

- Recognizing patterns: objects in real scenes, facial identities or facial expressions, and/or spoken words
- Recognizing anomalies: unusual sequences of credit card transactions, unusual patterns of sensor readings in a nuclear power plant
- Prediction: future stock prices or currency exchange rates, which movies a person will like

**What are Neural Networks?**

**Neural Networks** are a class of models within the general machine learning literature. Neural networks are a specific set of algorithms that have revolutionized machine learning. They are inspired by biological neural networks and the current so-called deep neural networks have proven to work quite well. Neural Networks are themselves general function approximations, which is why they can be applied to almost any machine learning problem about learning a complex mapping from the input to the output space.

Here are the three reasons you should study neural computation:

- To understand how the brain actually works: it’s very big and very complicated and made of stuff that dies when you poke it, so we need to use computer simulations.
- To understand a style of parallel computation inspired by neurons and their adaptive connections: it’s a very different style from sequential computation.
- To solve practical problems by using novel learning algorithms inspired by the brain: learning algorithms can be very useful even if they are not how the brain actually works.

**Top 10 Neural Network Architectures You Need to Know**

**1 - Perceptrons**

Considered the first generation of neural networks, * perceptrons* are simply computational models of a single neuron. Perceptron was originally coined by Frank Rosenblatt (“The perceptron: a probabilistic model for information storage and organization in the brain”)

^{[1]}. Also called

*feed-forward neural network*, a perceptron feeds information from the front to the back. Training perceptrons usually require

*back-propagation*, giving the network paired datasets of inputs and outputs. Inputs are sent into the neuron, processed, and result in an output. The error that is back propagated is usually the difference between the input and the output data. If the network has enough hidden neurons, it can always model the relationship between the input and output. Practically, their use is a lot more limited, but they are popularly combined with other networks to form new networks.

If you choose features by hand and have enough , you can do almost anything. For binary input vectors, we can have a separate feature unit for each of the exponentially many binary vectors and we can make any possible discrimination for binary input vectors. However, perceptrons do have limitations: once the hand-coded features have been determined, there are very strong limitations on what a perceptron can learn.

**2 - Convolutional Neural Networks**

In 1998, Yann LeCun and his collaborators developed a really good recognizer for handwritten digits called LeNet. It used back-propagation in a feedforward net with many hidden layers, many maps of replicated units in each layer, output pooling of nearby replicated units, a wide net that can cope with several characters at once, even if they overlap, and a clever way of training a complete system, not just a recognizer. It was later formalized under the name ***convolutional neural networks (CNNs)***.

Convolutional neural networks are quite different from most other networks. They are primarily used for image processing, but can also be used for other types of input, such as as audio. A typical use case for CNNs is where you feed the network images and it classifies the data. CNNs tend to start with an input “scanner,” which is not intended to parse all of the training data at once. For example, to input an image of 100 x 100 pixels, you wouldn’t want a layer with 10,000 nodes. Rather, you create a scanning input layer of say, 10 x 10, and you feed the first 10 x 10 pixels of the image. Once you’ve passed that input, you feed it the next 10 x 10 pixels by moving the scanner one pixel to the right.

This input data is then fed through convolutional layers instead of normal layers, where not all nodes are connected. Each node only concerns itself with close neighboring cells. These convolutional layers also tend to shrink as they become deeper, mostly by easily divisible factors of the input. Beside these convolutional layers, they also often feature *pooling layers*. Pooling is a way to filter out details: a commonly found pooling technique is *max pooling*, where we take, say, 2 x 2 pixels and pass on the pixel with the most amount of red. If you want to dig deeper into CNNs, read Yann LeCun’s original paper, “Gradient-based learning applied to document recognition” (1998) ^{[2]}.

**3 - Recurrent Neural Networks**

To understand RNNs, we need to have a brief overview of *sequence modeling*. When applying machine learning to sequences, we often want to turn an input sequence into an output sequence that lives in a different domain. For example, turn a sequence of sound pressures into a sequence of word identities. When there is no separate target sequence, we can get a teaching signal by trying to predict the next term in the input sequence. The target output sequence is the input sequence with an advance of one step. This seems much more natural than trying to predict one pixel in an image from the other pixels, or one patch of an image from the rest of the image. Predicting the next term in a sequence blurs the distinction between supervised and unsupervised learning. It uses methods designed for supervised learning but doesn’t require a separate teaching signal.

*Memoryless models* are the standard approach to this task. In particular, autoregressive models can predict the next term in a sequence from a fixed number of previous terms using “delay taps.” Feed-forward neural nets are generalized autoregressive models that use one or more layers of non-linear hidden units. However, if we give our generative model some hidden state, and if we give this hidden state its own internal dynamics, we get a much more interesting kind of model that can store information in its hidden state for a long time. If the dynamics and the way it generates outputs from its hidden state are noisy, we will never know its exact hidden state. The best we can do is infer a probability distribution over the space of hidden state vectors. This inference is only tractable for two types of hidden state models.

Originally introduced in Jeffrey Elman's “Finding structure in time” (1990) ^{[3]}, * recurrent neural networks (RNNs)* are basically perceptrons. However, unlike perceptrons, which are stateless, they have connections between passes, connections through time. RNNs are very powerful, because they combine two properties: 1) a distributed hidden state that allows them to store a lot of information about the past efficiently and 2) non-linear dynamics that allow them to update their hidden state in complicated ways. With enough neurons and time, RNNs can compute anything that your computer can compute. So what kinds of behavior can RNNs exhibit? They can oscillate, settle to point attractors, and behave chaotically. They can potentially learn to implement lots of small programs that each capture a nugget of knowledge and run in parallel, interacting to produce very complicated effects.

One big problem with RNNs is the vanishing (or exploding) gradient problem, where, depending on the activation functions used, information rapidly gets lost over time. Intuitively, this wouldn’t be much of a problem because these are just weights and not neuron states, but the weights through time is actually where the information from the past is stored. If the weight reaches a value of 0 or 1,000,000, the previous state won’t be very informative. RNNs can, in principle, be used in many fields, as most forms of data that don’t actually have a timeline (non- audio or video) can be represented as a sequence. A picture or a string of text can be fed one pixel or character at a time, so time dependent weights are used for what came before in the sequence, not actually what happened x seconds before. In general, recurrent networks are a good choice for advancing or completing information, like autocompletion.

**4 - Long / Short Term Memory**

Hochreiter & Schmidhuber (1997) ^{[4]} solved the problem of getting a RNN to remember things for a long time by building what is known as ***long-short term memory networks (LSTMs)***. LSTMs try to combat the vanishing/exploding gradient problem by introducing gates and an explicitly defined memory cell. The memory cell stores the previous values and holds onto it unless a "forget gate" tells the cell to forget those values. LSTMs also have an "input gate" that adds new stuff to the cell and an "output gate" that decides when to pass along the vectors from the cell to the next hidden state.

Recall that with all RNNs, the values coming in from X_train and H_previous are used to determine what happens in the current hidden state. The results of the current hidden state (H_current) are used to determine what happens in the next hidden state. LSTMs simply add a cell layer to make sure the transfer of hidden state information from one iteration to the next is reasonably high. Put another way, we want to remember stuff from previous iterations for as long as needed, and the cells in LSTMs allow this to happen. LSTMs are able to learn complex sequences, such as Hemingway’s writing or Mozart’s music.

**5 - Gated Recurrent Unit**

* Gated recurrent units (GRUs)* are a slight variation on LSTMs. They take X_train and H_previous as inputs. They perform some calculations and then pass along H_current. In the next iteration, X_train.next and H_current are used for more calculations, and so on. What makes them different from LSTMs is that GRUs don't need the cell layer to pass values along. The calculations within each iteration ensure that the H_current values being passed along either retain a high amount of old information or are jump-started with a high amount of new information.

In most cases, GRUs function very similarly to LSTMs, with the biggest difference being that GRUs are slightly faster and easier to run (but also slightly less expressive). In practice, these tend to cancel each other out, as you need a bigger network to regain some expressiveness, which then in turn cancels out the performance benefits. In some cases where the extra expressiveness is not needed, GRUs can outperform LSTMs. You can read more about GRUs in Junyoung Chung’s 2014 “Empirical evaluation of gated recurrent neural networks on sequence modeling” ^{[5]}.

**6 - Hopfield Network**

Recurrent networks of non-linear units are generally very hard to analyze. They can behave in many different ways: settle to a stable state, oscillate, or follow chaotic trajectories that cannot be predicted far into the future. To resolve this problem, John Hopfield introduced the Hopfield Net in his 1982 work “Neural networks and physical systems with emergent collective computational abilities” ^{[6]}. A * Hopfield network (HN)* is a network where every neuron is connected to every other neuron. It is a completely entangled plate of spaghetti as even all the nodes function as everything. Each node is inputted before training, then hidden during training and output afterwards. The networks are trained by setting the value of the neurons to the desired pattern, after which the weights can be computed. The weights do not change after this. Once trained for one or more patterns, the network will always converge to one of the learned patterns because the network is only stable in those states.

There is another computational role for Hopfield nets. Instead of using the net to store memories, we use it to construct interpretations of sensory input. The input is represented by the visible units, the states of the hidden units, and the badness of the interpretation is represented by the energy.

Unfortunately, people have shown that a Hopfield net is very limited in its capacity. A Hopfield net of N units can only memorize 0.15N patterns because of the so-called *spurious minima* in its energy function. The idea is that since the energy function is continuous in the space of its weights, if two local minima are too close, they might “fall” into each other to create a single local minima that doesn’t correspond to any training sample, while forgetting about the two samples it is supposed to memorize. This phenomenon significantly limits the number of samples that a Hopfield net can learn.

**7 - Boltzmann Machine**

A *Boltzmann Machine* is a type of stochastic recurrent neural network. It can be seen as the stochastic, generative counterpart of Hopfield nets. It was one of the first neural networks capable of learning internal representations and able to represent and solve difficult combinatoric problems. First introduced by Geoffrey Hinton and Terrence Sejnowski in “Learning and relearning in Boltzmann machines” (1986) ^{[7]}, Boltzmann machines are a lot like Hopfield Networks, but some neurons are marked as input neurons and others remain “hidden.” The input neurons become output neurons at the end of a full network update. It starts with random weights and learns through back-propagation. Compared to a Hopfield Net, the neurons mostly have binary activation patterns.

The goal of learning for a Boltzmann machine learning algorithm is to maximize the product of the probabilities that the Boltzmann machine assigns to the binary vectors in the training set. This is equivalent to maximizing the sum of the log probabilities that the Boltzmann machine assigns to the training vectors. It is also equivalent to maximizing the probability that we would obtain exactly the N training cases if we did the following: 1) let the network settle to its stationary distribution N different time with no external input and 2) sample the visible vector once each time.

An efficient mini-batch learning procedure was proposed for Boltzmann Machines by Salakhutdinov and Hinton in 2012 ^{[8]}.

- For the positive phase, first initialize the hidden probabilities at 0.5, clamp a data vector on the visible units, then update all of the hidden units in parallel until convergence using mean field updates. After the net has converged, record PiPj for every connected pair of units and average this over all data in the mini-batch.
- For the negative phase: first keep a set of “fantasy particles.” Each particle has a value that is a global configuration. Sequentially update all of the units in each fantasy particle a few times. For every connected pair of units, average SiSj over all of the fantasy particles.

In a general Boltzmann machine, the stochastic updates of units need to be sequential. There is a special architecture that allows alternating parallel updates that are much more efficient (no connections within a layer, no skip-layer connections). This mini-batch procedure makes the updates of the Boltzmann machine more parallel. This is called a Deep Boltzmann Machine (DBM), a general Boltzmann machine with a lot of missing connections.

**8 - Deep Belief Networks**

Back-propagation is considered the standard method in artificial neural networks for calculating the error contribution of each neuron after a batch of data is processed. However, there are some major problems using back-propagation. First, it requires labeled training data while almost all data is unlabeled. Second, the learning time does not scale well, which means it is very slow in networks with multiple hidden layers. Third, it can get stuck in poor local optima, so for deep nets, they are far from optimal.

To overcome the limitations of back-propagation, researchers have considered using unsupervised learning approaches. This helps keep the efficiency and simplicity of using a gradient method for adjusting the weights, while also using to model the structure of the sensory input. In particular, they adjust the weights to maximize the probability that a generative model would have generated the sensory input. The question is what kind of generative model should we learn? Can it be an energy-based model like a Boltzmann machine? Or a causal model made of idealized neurons? Or a hybrid of the two?

Yoshua Bengio came up with * Deep Belief Networks* (“Greedy layer-wise training of deep networks”)

^{[9]}, which have been shown to be effectively trainable stack by stack. This technique is also known as greedy training, where greedy means making locally optimal solutions to get to a decent but possibly not optimal answer. A belief net is a directed acyclic graph composed of stochastic variables. Using belief net, we get to observe some of the variables, and we would like to solve two problems: 1) the inference problem: infer the states of the unobserved variables, and 2) the learning problem: adjust the interactions among variables to make the network more likely to generate the training data.

Deep Belief Networks can be trained through contrastive divergence or back-propagation and learn to represent the data as a probabilistic model. Once trained or converged to a stable state through unsupervised learning, the model can be used to generate new data. If trained with contrastive divergence, it can even classify existing data because the neurons have been taught to look for different features.

**9 - Autoencoders**

**Autoencoders** are neural networks designed for unsupervised learning, i.e. when the data unlabeled. As data-compression models, they can be used to encode a given input into a representation of smaller dimension. A decoder can then be used to reconstruct the input back from the encoded version.

The work they do is very similar to *Principal Component Analysis*, which is generally used to represent a given input using fewer numbers of dimensions than originally present. So, for example, in NLP, if you represent a word as a vector of 100 numbers, you could use PCA to represent it in 10 numbers. Of course, that would result in loss of some information, but it is a good way to represent your input if you can only work with a limited number of dimensions. Also, it is a good way to visualize the data because you can easily plot the reduced dimensions on a 2D graph, as opposed to a 100-dimensional vector. Autoencoders do similar work — the difference is that they can use non-linear transformations to encode the given vector into smaller dimensions (compared to PCA, which is a linear transformation), so it can generate more complex encodings.

They can be used for dimension reduction, pre-training of other neural networks, data generation, etc. There are a couple of reasons: (1) they provide flexible mappings both ways, (2) the learning time is linear (or better) in the number of training cases, and (3) the final encoding model is fairly compact and fast. However, it turns out that it’s very difficult to optimize deep auto -encoders using back-propagation. With small initial weights, the back-propagated gradient dies. Nowadays, they are rarely used in practical applications, mostly because in the key areas where they were once considered breakthroughs (such as layer-wise pre-training), vanilla supervised learning works better. Check out the original paper by Bourlard and Kamp, dated 1988 ^{[10]}.

**10 - Generative Adversarial Network**

In “Generative adversarial nets” (2014) ^{[11]}, Ian Goodfellow introduced a new breed of neural network, in which two networks work together. * Generative Adversarial Networks (GANs)* consist of any two networks (although often a combination of Feed Forwards and Convolutional Neural Nets), with one tasked to generate content (generative) and the other to judge content (discriminative). The discriminative model has the task of determining whether a given image looks natural (an image from the dataset) or artificially created. The generator’s task is to create natural looking images that are similar to the original data distribution. This can be thought of as a zero-sum or minimax two player game. The analogy used in the paper is that the generative model is like “a team of counterfeiters, trying to produce and use fake currency” while the discriminative model is like “the police, trying to detect the counterfeit currency.” The generator is trying to fool the discriminator while the discriminator is trying to not get fooled by the generator. As the models train through alternating optimization, both methods are improved until the point where the “counterfeits are indistinguishable from the genuine articles.”

According to Yann LeCun, these networks could be the next big development. They are one of the few successful techniques in unsupervised machine learning, and are quickly revolutionizing our ability to perform generative tasks. Over the last few years, we’ve come across some very impressive results. There is a lot of active research in the field to apply GANs for language tasks, improve their stability and ease of training, and so on. They are already being applied in industry for a variety of applications, ranging from: interactive image editing, 3D shape estimation, drug discovery, and semi-supervised learning, to robotics.

## Conclusion

Neural networks are one of the most beautiful programming paradigms ever invented. In the conventional approach to programming, we tell the computer what to do and break big problems up into many small, precisely defined tasks that the computer can easily perform. In contrast, we don’t tell the computer how to solve our problems for a neural network. Instead, it learns from observational data and figures out its own solution to the problem.

Today, deep neural networks and deep learning achieve outstanding performance for many important problems in computer vision, speech recognition, and natural language processing. They’re being deployed on a large scale by companies such as Google, Microsoft, and Facebook. I hope that this post helps you learn the core concepts of neural networks, including modern techniques for deep learning.

**Additional Readings**

- A Visual and Interactive Guide to the Basics of Neural Networks
- Yes you should understand backprop
- ConvNets: A Modular Perspective
- Understanding Convolutions
- The Unreasonable Effectiveness of Recurrent Neural Networks
- Understanding LSTM Networks
- Generative Models

Paper References

Rosenblatt, Frank. “The perceptron: a probabilistic model for information storage and organization in the brain.” Psychological review 65.6 (1958): 386. ↩︎

LeCun, Yann, et al. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998): 2278-2324. ↩︎

Elman, Jeffrey L. “Finding structure in time.” Cognitive science 14.2 (1990): 179-211. ↩︎

Hochreiter, Sepp, and Jürgen Schmidhuber. “Long short-term memory.” Neural computation 9.8 (1997): 1735-1780. ↩︎

Chung, Junyoung, et al. “Empirical evaluation of gated recurrent neural networks on sequence modeling.” arXiv preprint arXiv:1412.3555 (2014). ↩︎

Hopfield, John J. “Neural networks and physical systems with emergent collective computational abilities.” Proceedings of the national academy of sciences 79.8 (1982): 2554-2558. ↩︎

Hinton, Geoffrey E., and Terrence J. Sejnowski. “Learning and releaming in Boltzmann machines.” Parallel distributed processing: Explorations in the microstructure of cognition 1 (1986): 282-317. ↩︎

Salakhutdinov, Rusland R., and Hinton, Geoffrey E.. “Deep Boltzmann Machines.” Proceedings of the 20th International Conference on AI and Statistics, Vol.5, pp. 448-455, Clearwater Beach, Florida, USA, 16-18 Apr 2009. PMLR. ↩︎

Bengio, Yoshua, et al. “Greedy layer-wise training of deep networks.” Advances in neural information processing systems 19 (2007): 153. ↩︎

Bourlard, Hervé, and Yves Kamp. “Auto-association by multilayer perceptrons and singular value decomposition.” Biological cybernetics 59.4-5 (1988): 291-294. ↩︎

Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in Neural Information Processing Systems. 2014. ↩︎