When I fist implemented them, I just assume that it took tons of neurons in the hidden layer to learn any function, even simple ones - because that was the results I had obtained. I also didn't know as much at the time about machine learning in general, how to think about the complexity of a decision boundary how to interpret and investigate my results. So that not so great code has been sitting in JSAT for a while, and I've recently re written all the code from scratch. The new implementation is much cleaner and faster, and I figured I would go over a bit about how the weights are set up.
First is how Feed Forward networks are set up, as demonstrated by this beautiful diagram I have created.
This is a neural network with one input layer (the input is a vector of data points, the \(x\)), one hidden layer, and one output neuron. Every circle represents a "neuron". So this network would be predicting a regression problem. For classification problems, we tend to have \(C\) outputs, one for each output class (or optionally, \(C-1\) outputs - but I prefer encoding all C). There is one hidden layer with \(k\) hidden units, and they feed into the output layer. You'll notice the input and hidden layer have a "+1" unit, which is a unit that always returns positive one. For simplicity, I havent put in every connection. In general we make each neuron in one layer connect to each and every neuron in the other layer.
So, in this example, the input to \(h_1 = x_0 \cdot w_{1,0} + x_1 \cdot w_{1,1} + \cdots + x_n \cdot w_{1,n} + 1 \cdot w_{1,bias} = \vec{x} \cdot \vec{w_1} \)
Because each neuron goes through this processes, we can compute each value as a dot product. And this can be represented as a matrix vector operation \(\vec{h} = W \vec{x} \). But we dont want to manually add a bias term to everything, as that would take a lot of copying. Instead we move the bias weights out and use the "+1" bias term implicitly. So we end up \(\vec{h} = W \vec{x} + \vec{b}\) where \(\vec{b}\) contains the values of the bias weights. \(W\) would be a k by d matrix. One row for each of the output values, and one column for each of the input values. \(\vec{b}\) would also be of length k, since it needs to add one weight term to each output. The output \(\vec{h}\) would then have the activation function applied to each of its values before being feed into the next layer.
This bias term was in fact the biggest mistake I made when I first implemented the Neural Network code. Instead of storing the weight terms separately, I created a new vector with a bias term added - and adjusted the weights accordingly. This added the bias term for the input layer to the first hidden layer, but neglected a biast term from the hidden layer to the output layer. What ended up happening is that I would have to add a lot of extra neurons so that they could collectively learn an even more complicated decision boundary that would stretch out from the origin of the world to wrap around the points that needed to be classified.
I wont go over the rest of the algorithm, as you can find many many descriptions of it - but I wanted to go over how the linear algebra is set up because so many examples just drop the bias terms, and it was enough to confuse me when I first looked at Neural Networks.
So to end, below is a couple of comparisons between the new and old code. Each example is learned with 10 neurons, using the same number of iterations and learning rate for each algorithm. In addition I also show the results for the old algorithm with 40 neurons to show what went wrong when it looks different from the 10 neuron case. The new code didn't use any of the new features that the old one did not have.
New with 10 hidden neurons
|
Old with 10 hidden neurons
|
Old with 40 hidden neurons
|
|
|
|
No Difference
|
||
No Difference
|
The new code is much cleaner and faster then the old code as well. It also includes the following new features:
- Momentum
- Weight Decay
- More Activation Functions
- Mini Batch Size
- Different Weight Initializations
- Learning Rate Decay
Which make a huge amount of difference in what you can learn and how quickly you can learn it. There is still more to add, including multicore support. However, it should all be much easier to add with the new code, including adding an sparse autoencoder on top of the same code.