Tuesday, March 12, 2013

Neural Networks and Bias

Neural Networks (NN) haven't been sexy for a long time. I'm not talk about deep networks or autoencoders or any of the new very fancy NNs, but classic back projection trained feed forward neural networks. They are still very useful and powerful classifiers, and one of the first classifiers I ever implemented in JSAT. I also implemented it badly, and incorrectly. However, NNs are so powerful it was actually able to learn around the bugs in my code to still get decent results at times.

When I fist implemented them, I just assume that it took tons of neurons in the hidden layer to learn any function, even simple ones - because that was the results I had obtained. I also didn't know as much at the time about machine learning in general, how to think about the complexity of a decision boundary  how to interpret and investigate my results. So that not so great code has been sitting in JSAT for a while, and I've recently re written all the code from scratch. The new implementation is much cleaner and faster, and I figured I would go over a bit about how the weights are set up.

First is how Feed Forward networks are set up, as demonstrated by this beautiful diagram I have created.


This is a neural network with one input layer (the input is a vector of data points, the \(x\)), one hidden layer, and one output neuron. Every circle represents a "neuron". So this network would be predicting a regression problem. For classification problems, we tend to have \(C\) outputs, one for each output class (or optionally, \(C-1\) outputs - but I prefer encoding all C). There is one hidden layer with \(k\) hidden units, and they feed into the output layer. You'll notice the input and hidden layer have a "+1" unit, which is a unit that always returns positive one. For simplicity, I havent put in every connection. In general we make each neuron in one layer connect to each and every neuron in the other layer.

So, in this example, the input to \(h_1 = x_0 \cdot w_{1,0} + x_1 \cdot w_{1,1} + \cdots + x_n \cdot w_{1,n} + 1 \cdot w_{1,bias} = \vec{x} \cdot \vec{w_1} \)
Because each neuron goes through this processes, we can compute each value as a dot product. And this can be represented as a matrix vector operation \(\vec{h} = W \vec{x} \). But we dont want to manually add a bias term to everything, as that would take a lot of copying. Instead we move the bias weights out and use the "+1" bias term implicitly. So we end up \(\vec{h} = W \vec{x} + \vec{b}\) where \(\vec{b}\) contains the values of the bias weights. \(W\) would be a k by d matrix. One row for each of the output values, and one column for each of the input values. \(\vec{b}\) would also be of length k, since it needs to add one weight term to each output. The output \(\vec{h}\) would then have the activation function applied to each of its values before being feed into the next layer.

This bias term was in fact the biggest mistake I made when I first implemented the Neural Network code. Instead of storing the weight terms separately,  I created a new vector with a bias term added - and adjusted the weights accordingly. This added the bias term for the input layer to the first hidden layer, but neglected a biast term from the hidden layer to the output layer. What ended up happening is that I would have to add a lot of extra neurons so that they could collectively learn an even more complicated decision boundary that would stretch out from the origin of the world to wrap around the points that needed to be classified.

I wont go over the rest of the algorithm, as you can find many many descriptions of it - but I wanted to go over how the linear algebra is set up because so many examples just drop the bias terms, and it was enough to confuse me when I first looked at Neural Networks.

So to end, below is a couple of comparisons between the new and old code. Each example is learned with 10 neurons, using the same number of iterations and learning rate for each algorithm. In addition  I also show the results for the old algorithm with 40 neurons to show what went wrong when it looks different from the 10 neuron case. The new code didn't use any of the new features that the old one did not have.

New with 10 hidden neurons
Old with 10 hidden neurons
Old with 40 hidden neurons

No Difference
No Difference

As you can see, the bias makes a huge difference! I also didn't have this tool for visualizing 2 dimensional data sets. Despite the simplicity and naivety of 2D problems, it really allows your to see what is learned and confirm that your algorithm learns boundaries it should be able to handle with ease. You can see in a lot of the examples how the old code managed to learn something would result in a low training and test error, despite being a very bad decision boundary.

The new code is much cleaner and faster then the old code as well. It also includes the following new features:

  • Momentum
  • Weight Decay
  • More Activation Functions
  • Mini Batch Size
  • Different Weight Initializations
  • Learning Rate Decay
Which make a huge amount of difference in what you can learn and how quickly you can learn it. There is still more to add, including multicore support. However, it should all be much easier to add with the new code, including adding an sparse autoencoder on top of the same code.