# Artificial Intelligence : Learning to model a Neural Network

About a year ago, I developed interests in Artificial Intelligence, and with the guidance of my mentor, was introduced to some machine learning concepts: **Classification**, **Logistic Regression**, **K-Means Clustering**, and others. I decided to venture into **Artificial Neural Networks**, by building a simple network that models the Pima Diabetes Dataset, and determines the chances that an individual is diabetic or not.

Yes, there are many powerful libraries that easily model and train neural networks, and there are also numerous articles that illustrate how to build a neural network. But after building from scratch (or with only Numpy), I have a better understanding of the basic concepts and techniques for modelling and training neural networks. It’s been a long and interesting journey, and I felt like sharing, hopefully adding to someone’s knowledge and brushing up my writing skills.

## Structure of the Neural Network

Neural Networks are a family of powerful machine learning algorithms that have been applied successfully in solving many problems, including the diagnosis of diabetes in an individual. A neural network has an **input layer** and an **output layer**. Some neural networks have at least 1 **hidden layer**. The neural network described in this article is a multi-layer artificial neural network. It has 8 neurons in the input layer, and 1 neuron in the output layer. The 8 input neurons represent the 8 features of the dataset, while the output neuron indicates the class of an individual's record. I varied the number of hidden layers, and the number of neurons in each hidden layer; each network cofiguration affected the performance in its way.

Training a neural network is made up of easy steps —**Initialising the Parameters**, **Forward Propagating the Activations**, **Evaluating the Cost Function**, **Back-Propagating the Error Derivatives** and **Updating the Parameters**. These steps are carried out iteratively until the stopping criteria is reached. I've briefly discussed them below, together with some code snippets.

## Initialising the Parameters

The connections between neurons are called **synaptic weights** and the value of each weight is an indicator of its contribution to the final output. A neuron has an optional **bias**, used to control the ease with which it fires (or activates). It is by updating these parameters (synaptic weights and biases) that the neural network learns. For my network, I randomly initialised the weights and used zeros for the biases. However, the initialisation of these hyper-parameters is an interesting area of research.

**Code update**: Created a `Network`

class; defined the `sigmoid`

and `sigmoid_derivative`

functions

```
from numpy import exp, random, zeros
# Sigmoid (Logistic) Activation Function
def sigmoid (x):
return 1 / (1 + exp(-x))
# Derivative of the Sigmoid Function
def sigmoid_derivative (z):
return z * (1 - z)
# Neural Network
class Network:
def __init__ (self, sizes):
# Good Practice
random.seed(1)
# Number of features; number of neurons in the input layer
self.no_features = sizes[0]
# Randomly initialize the weights
# Initialise the biases to zeros
self.biases = [ zeros((a, 1)) for a in sizes[1:] ]
self.weights = [ 2 * random.randn(a, b) - 1 for a, b in zip(sizes[:-1], sizes[1:]) ]
```

## Forward Propagating the Activations (Feed-Forward)

During Feed-forward, a record (from the dataset) is passed from the input layer, through the hidden layers, to the output layer. A number of things happen here:

- The
*value*() of a neuron is the*logit***weighted-sum of the activations**of all neurons connected to it, plus the optional**bias**value. The input layer neurons are*already activated*; there’s no need for the weighted-sum arithmetic. - The output of a neuron is determined by the activation function, which is applied on the
**logit**, and the result is passed to the neurons in the next layer. This continues until the network output is obtained. (Note that applying the activation function is optional; some network architectures have no activation functions). I used the**sigmoid activation function**(**logistic function**); its output is between 0 and 1, and it has this special property - its**derivative**is a simple mathematical operation on its output. For my implementation, values of 0.5 and higher indicate a classification of 1 (diabetic), while lower values indicate a classification of 0 (not diabetic).

**Code update**: `import`

ed the `reshape`

library, and added a `train`

method to the `Network`

class.

```
from numpy import exp, random, zeros, reshape
class Network:
# ...
def train (self, training_data, epochs=3000):
for iter in xrange(epochs):
for x, y in training_data:
activation = reshape(x, (self.no_features, 1))
activations = [activation]
# Forward Propagation (Feed-Forward)
for we, bi in zip(self.weights, self.biases):
activation = sigmoid(we.T.dot(activation) + bi)
activations.append(activation)
```

## Evaluating the Cost Function

For each record in the Pima dataset, there is an associated output value called the **target value** (typically for Supervised Learning problems). This value is compared with the output of the neural network (**the actual value**). The cost function (often called the objective function, error function, loss function), is a function of the target value and the actual value, and indicates how much the network output differs from the target. I used the **Mean Squared Error (MSE)** function for my implementation. Its derivative, which is most needed, is a simple arithmetic operation of the two values. By minimising the cost function, the network output gets closer and closer to the target.

**Code update**: Defined the `mse_derivative`

function, used to calculate the `error_derivative_wrt_output_activation`

.

```
# ...
# Derivative of the Mean Squared Error Cost Function
def mse_derivative (output, target):
return output - target
class Network:
# ...
def train (self, training_data, epochs=3000):
for iter in xrange(epochs):
for x, y in training_data:
activation = reshape(x, (self.no_features, 1))
activations = [activation]
# Forward Propagation
for we, bi in zip(self.weights, self.biases):
activation = sigmoid(we.T.dot(activation) + bi)
activations.append(activation)
# Error Derivative
error_derivative_wrt_output_activation = mse_derivative(activations[-1], y)
```

## Back-Propagating the Error Derivatives

As stated, the neural network learns by updating the parameters, thanks to the popular **Gradient Descent (GD)** algorithm. It uses the gradient with respect to a variable, to perform the minimisation of a cost function. The algorithm can be visualised like a ball rolling down a hill. However, it requires the gradient with respect to each parameter to be computed and ready. The powerful **Back-propagation** algorithm is the answer here. It is a computationally-expensive and fast algorithm that gives you the error derivatives needed to perform gradient descent.

**Code update**: Initialized the `delta`

s; implemented back-propagation.

```
# ...
class Network:
def __init__ (self, sizes):
# ...
# The number of weight matrices
# Basically the number_of_network_layers - 1
self.btw_layers = len(sizes) - 1
# Deltas (error_derivatives), with respect to the parameters
# Initialized to zeros
self.weight_deltas = [ zeros(w.shape) for w in self.weights ]
self.bias_deltas = [ zeros(b.shape) for b in self.biases ]
def train (self, training_data, epochs=3000):
for iter in xrange(epochs):
for x, y in training_data:
# ...
# Error Derivative
error_derivative_wrt_output_activation = mse_derivative(activations[-1], y)
delta = error_derivative_wrt_output_activation * sigmoid_derivative(activations[-1])
# Back-Propagation
for l in xrange(self.btw_layers):
self.bias_deltas[-l-1] = delta
self.weight_deltas[-l-1] = activations[-l-2].dot(delta.T)
delta = (self.weights[-l-1] * sigmoid_derivative(activations[-l-2])).dot(delta)
```

## Updating the Parameters

And with the heavy computations done, updating the parameters (with gradient descent) became relatively easy. The parameters were updated; some were increased, others decreased. The rate at which these parameters are updated is controlled by the `learning_rate`

.

**Code update**: Included the `learning_rate`

, used to update parameters over the `batch_total`

, using Gradient Descent.

```
class Network:
# ...
def train (self, training_data, learning_rate=5.0, epochs=3000):
batch_total = len(training_data)
for iter in xrange(epochs):
for x, y in training_data:
# ...
# Forward Propagation
# Error Derivative
# Back-Propagation
# ...
# Update Parameters (Weights and Biases)
for l in xrange(self.btw_layers):
self.biases[l] = self.biases[l] - ((learning_rate / batch_total) * self.bias_deltas[l])
self.weights[l] = self.weights[l] - ((learning_rate / batch_total) * self.weight_deltas[l])
```

## Training the Network / Evaluating the Accuracy

How can one ensure the network is actually learning, and that updating parameters gradually drives the network output to the target? I split the dataset into two parts: Training data and Test data. (I didn’t cater for Validation data, not yet!). As stopping criteria, I initially experimented with 3000 epochs, then different values later. During each epoch, training data activations were *fed-forward* from the input layer to the output layer, the error-derivates were computed and back-propagated, and the parameters were updated. After every 10 epochs, I performed only forward propagation with the test data, and compared the output with the target values to calculate the accuracy. Accuracy here refers to the number of records of the test data, for which the actual output equals the target output.

**Code update**: Measured the training `start_time`

; Defined a method to `evaluate`

the network's performance. Included `test_data`

and `check_at`

as arguments to `train`

method.

```
# ...
from time import time
class Network:
# ...
def evaluate (self, test_data):
accuracy = 0.0
for x_test, y_test in test_data:
act = reshape(x_test, (self.no_features, 1))
for l in xrange(self.btw_layers):
act = sigmoid(self.weights[l].T.dot(act) + self.biases[l])
if round(act) == y_test:
accuracy += 1
return accuracy
def train (self, training_data, learning_rate=5.0, epochs=3000, check_at=10, test_data=None):
start_time = time()
for iter in xrange(epochs):
# ...
if iter % check_at == 0:
if test_data != None:
test_total = len(test_data)
accuracy = self.evaluate(test_data)
print "Accuracy: {0}/{1} => {2}%".format(accuracy, test_total, 100*(accuracy/test_total))
print "After {0} epoch(s), in {1} seconds\n".format(iter, time() - start_time)
```

# Testing with the Exclusive-OR (X-OR) problem

I first tried the network on the X-OR problem, using a small network with 2 input neurons, 1 hidden layer having 3 neurons, and 1 output neuron. The data for the X-OR problem is relatively small; just 4 rows, used as training data and test data.(I evaluated after every 5 epochs).

```
from numpy import array
neural_net = Network([2, 3, 1])
x_train = array([
[0, 0],
[0, 1],
[1, 0],
[1, 1]
])
y_train = array([
[0],
[1],
[1],
[0]
])
training_data = zip(x_train, y_train)
test_data = training_data
neural_net.train(training_data, test_data=test_data, check_at=5)
```

In just over 0.08 seconds, after about 210 epochs, the network was able to correctly model the X-OR problem, with 100% accuracy.

# Testing with the Pima Diabetes Dataset

Then I tested with the Pima diabetes dataset. I stored the training data in a file named `train.csv`

, and the test data in another file named `test.csv`

. After extracting and preparing the data, I proceeded to train the network, only to face some challenges.

```
from numpy import genfromtxt
# ...
neural_net = Network([8, 20, 50, 100, 1])
# Extract from train/test files
train = genfromtxt('train.csv', delimiter=",")
test = genfromtxt('test.csv', delimiter=",")
# Separate the train/test features, from target values
x_train = train[:, 0:8]
y_train = train[:, [8]]
x_test = test[:, 0:8]
y_test = test[:, [8]]
# Prepre (zip) the train/test data
training_data = zip(x_train, y_train)
test_data = zip(x_test, y_test)
neural_net.train(training_data, test_data=test_data, check_at=100)
```

## Challenges/Requirements

**What you need to know**: As stated earlier, training a neural network can be straight forward, consisting of iterative steps. However, the details of these separate steps can be challenging and confusing. For instance, implementing the back-propagation algorithm was not so easy for me, especially with many hidden layers. Knowledge of Calculus and Numerical Analysis is always very helpful, although it's not a major requirement.**Numpy (Python) Limitation**: With Numpy, it’s easy to perform complex numerical computations like calculating derivatives, especially involving multi-dimensional matrices. One major computation is that of the sigmoid function which requires the calculation of the exponential —`exp`

. Unfortunately, Numpy has some limitations when calculating the`exp`

of values higher than`709 (709.7827)`

or less than`-745 (-745.133)`

, and with the dataset, I got values of about`881.52149758`

and`-804.364266793`

for the weighted-sum. Numpy produced this warning -`RuntimeWarning: overflow encountered in exp`

- and apparently, this is a limitation with Python. Try calculating the`exp`

for`710`

or`-746`

, using Numpy or just Python.

This limitation affected the performance of my network; the loss of data meant the network was not learning at all. When I tried another Python library (Bigfloat), I was able to get the`exp`

for some larger numbers. But I ran into another challenge, with manipulating numpy arrays. I eventually discovered a way out — Feature Scaling.**Feature Scaling**: This helped present the values of the data within a particular range, while retaining some inherent relationships between them. I used Min-Max Normalisation, to have a fixed range of values, from 0 to 1. There are however other techniques for feature scaling, each one with its advantages.

**Code update**: Defined the `min_max_norm`

function to perform feature scaling on both `training_data`

and `test_data`

. Also imported `array`

from `numpy`

.

```
from numpy import genfromtxt, array
# ...
def min_max_norm (x):
x_min = x.min();
return (x - x_min) / (x.max() - x_min)
# ...
# Apply feature scaling with min-max feature scaling
x_train = array([ min_max_norm(x) for x in x_train.T ]).T
x_test = array([ min_max_norm(x) for x in x_test.T ]).T
# ...
neural_net.train(training_data, test_data=test_data, check_at=100)
```

With feature scaling, I could then train the network, and test with my test data. The image below shows the network performance over the training data (green) and the test data (blue). *This and subsequent graphs were plotted with matplotlib library for python.*

It is clear from the image that some work still needs to be done. The network accuracy (on the training data) got to the range of ~90% only after about 1500 epochs. For the test data though, accuracy was still pretty low.

## Improving the Performance of the Network

There are some concepts that help improve the performance of a neural network including Stochastic Gradient Descent, Regularisation Techniques, Momentum and Velocity, Weight Initialisation etc. Additionally, some network parameters like the learning rate, regularisation parameter, momentum coefficient, number of hidden layers, number of neurons in each hidden layer, can be tuned to achieve even greater results.

**Stochastic Gradient Descent (SGD)**: This is a modification of the gradient descent algorithm. Instead of training the network with the entire training data at once, the data is (randomly) divided into mini-batches and the network is trained with the mini-batches. The size of a mini-batch (`mini_batch_size`

) is another tunable parameter, passed to the`train`

method as an argument. SGD has better performance and achieves accuracy faster than normal GD (batch gradient descent), as visualised in the image below.

**Code update**: Added tunable `mini_batch_size`

parameter to the `train`

method. Performed training over each mini-batch.

```
# ...
class Network:
# ...
def train (self, training_data, epochs=3000, learning_rate=5.0, mini_batch_size=None, check_at=10, test_data=None):
total_trainig_data = len(training_data)
if mini_batch_size == None: mini_batch_size = total_trainig_data
# shuffle the training_data
random.shuffle(training_data)
# Split the data into mini-batches
mini_batches = [ training_data[k:k+mini_batch_size] for k in xrange(0, total_trainig_data, mini_batch_size) ]
start_time = time()
for iter in xrange(epochs):
for mini_batch in mini_batches:
for x, y in mini_batch:
activation = reshape(x, (self.no_features, 1))
activations = [activation]
# ...
# ...
neural_net.train(training_data, test_data=test_data, check_at=100, epochs=1000, learning_rate=5.0, mini_batch_size=10)
```

It is obvious that the network performance improved for the training data. About 95% accuracy was achieved even before 500 epochs. This illustrates how SGD can improve the performance of a neural network vs normal (Batch) Gradient Descent. However, there wasn't much improvement with the test data.

**Overfitting & Regularisation**: From the images shown so far, the network performed better with the training data than with the test data. This is an indication that**over-fitting**is affecting the performance of the network, as the network does not generalise to new and unseen records. Over-fitting is one of the challenges with modelling and training neural networks. Different regularisation techniques are available to help combat over-fitting. I attempted to implement the L2-Regularisation technique; the image below shows the result. By tuning a`regularisation_parameter`

passed to the`train`

method, I was able to further modify the perfomance of the network.

**Code update**: Added a `regularisation_parameter`

, applied only to the `weight_deltas`

.

```
# ...
class Network:
# ...
def train (self, training_data, epochs=3000, learning_rate=5.0, regularisation_parameter=0.0, mini_batch_size=None, check_at=10, test_data=None):
# ...
for iter in xrange(epochs):
for mini_batch in mini_batches:
batch_total = len(mini_batch)
for x, y in mini_batch:
# ...
# Update Parameters (Weights and Biases)
for l in xrange(self.btw_layers):
self.biases[l] = self.biases[l] - ((learning_rate / batch_total) * self.bias_deltas[l])
self.weights[l] = ((1 - (learning_rate * (regularisation_parameter/batch_total))) * self.weights[l]) - ((learning_rate / batch_total) * self.weight_deltas[l])
# ...
neural_net.train(training_data, test_data=test_data, regularisation_parameter=0.000725, check_at=10, epochs=epochs, learning_rate=5.0, mini_batch_size=10)
```

Notice that the green line is below the blue line. This is because with regularisation, we are more concerned with obtaining a generalised model that similarly represents both the training and test data. We care less about fitting all the records of the training data. A perfect model would fit all records of both datasets, attaining 100% accuracy.

**Learning Rate**: The rate at which the network learns is tuned by this factor -`learning_rate`

. A very small learning rate would mean that the network learns very very slowly, gradually approaching the optimum point. However, with a very large learning rate, we can experience over-shooting and its undesireable effects. The best value for the learning rate comes after experimenting with many values. I used values of`0.01`

,`0.5`

,`3.0`

,`5.0`

,`7.0`

, before making a final decision of`5.0`

. Small positive values for the learning rate are advised and there are some techniques that help indicate an effective learning rate.

# Futher Learning/Research

Now, I find myself at an interesting position. I feel good with what I've learnt and implemented - Neural Networks, Back-Propagation, Feature Scaling, Stochastic Gradient Descent, Regularisation (L2), etc. I also feel overwhelmed, by the numerous concepts yet to be learnt.

- First, the Cross-Entropy Cost Function is another cost function, with some advantages over the MSE cost function, and is worth implementing.
- Additionally, I discovered that my network was suffering from
**Vanishing/Exploding Gradient**because of the nature of the sigmoid activation function. Implementing the ReLU activation function should reduce the effect of the problem, not really eliminate it. - Also, as noticed, I randomly initialised the weights, but this has some negative effects.
**Xavier Initialisation**is the more preferred initialisation technique, and is definitely worth learning, understanding and implementing.

These are just some of the many concepts to understand for anyone interested in Artificial Intelligence and Machine Learning. I hope to share as I learn, perhaps in subsequent posts. For some resources and books, Andrew Ng's Machine Learning course is good for some basic ML understanding. Michael Nielsen's online book is good to start coding with Python, and Deep Learning is very very rich. Rohan Kapur and co have really good articles, and there are millions of articles, books, videos, etc, available on the internet.

The full network code is saved to this Github gist.