Machine Learning with JSAT: November 2012

In my last post, I talked a bit about bias, variance, and their importance in Ensemble methods. In this post I'll provide some visual examples that will hopefully better illustrate the concept. To do this, I'll be using Decision Stumps and Decision Trees.

A Decision Stump (or Stumps) are just a Decision Tree with only one level. That means a stump can only select one attribute to split on. I will be comparing Apples to Apples here, so the main difference is how splitting will be done. The only setting that will change when I compare 2 Stumps or Trees will be how splitting is done. The gain method, depth, pruning method, and everything else will be the same.

C4.5 style splitting. Numeric attributes will be split by creating only a single binary divide that maximizes the gain.
My PDF splitting method. Numeric attributes can split an arbitrary number of times. Splitting is done by approximating the PDF of the attribute for each class using a KDE, and splitting at the intersections of the PDFs. All splits for the same class map to one path, so multiples value ranges map to a single path. Thus, the splits map every data point to class that had the maximum probability in a given range when considering only one attribute. (Mine is a bit more complicated!)

The first method (C45) has a higher bias, since it can only do one split along the axis, and a lower variance.

The second method (MyPDF) has a lower bias, since it can adapt to the fluctuations in the PDF of the attribute in question. This increases the variance, since if the PDFs would be truly equal, random fluctuations may cause intersections that make us split for no good reason.

To start with, I'll give a simple example data set to illustrate the difference between these two methods using Stumps.

Here are the results for a 4 class problem, 1 class in each corner - generated from Gaussians. (Make sure to click on the images to get a better view!)

Stumps C45

Stumps MyPDF

Clearly, the C45 produces a cleaner looking cut, but they would both actually have the same error rate in practice. So how does it look when we build a full tree?

Tree C45

Tree MyPDF

Again, the C45 looks a lot better. But if you look at the MyPDF version, you can see it has adapted to have zero training errors. But this is a very simple problem that is well suited to splitting along the axis. Lets now look at something more complicated

The next few examples we will look at the difference between the two when using Bagging. Bagging (and boosting at the end) are always done with 40 iterations.

This first data set is a series of encompassing circles. The outer ones are very close too each other, and the inner ones are farther apart.

Tree C45

Tree MyPDF

Here we can see they have similar performance. The two inner circles look better under MyPDF, but it also has some noise jutting out between the two outer ones. Now lets see how they look after 40 rounds of bagging.

Bagged Tree C45

Bagged Tree MyPDF

The difference is dramatic! The Bagged version of C45 improved a little bit, with slightly more rounded boundaries in the outer circles. However MyPDF now has very good and much smoother boundaries for the outer circles. It has less overlap, and the boundary stays about in the middle where it belongs.

Another interesting example, this data set contains 3 classes growing out in a spiral. The complexity of the problem adds some more bias to the error term, but its still a simple data set. First, the normal Trees.

Tree C45

Tree MyPDF

The differences are much more pronounced using this data set. C45 is cleaner, but not very spiral like. MyPDF is more spiral like, but very blocky. Another 40 rounds of Bagging latter:

Bagged Tree C45

Bagged Tree MyPDF

This one is more complicated. Both are still getting errors after bagging, but MyPDF has much more of a spiral shape to it. C45 is still very rigged. Because these artificial data sets have no noise in them, all the variance has to come out of the algorithm itself.

Lastly, lets look at the more complex data set from my previous post on KDE and k-NN. Here is the results for Trees on the same data set.

Tree C45

Tree MyPDF

Clearly, both could do better. There are lots of mistakes in each, but the general idea is there. Despite this, the boundaries are actually more similar than they were in previous problems.

Bagged Tree C45

Bagged Tree MyPDF

In this more complex data set, we can see the differences between the two are shrinking when using only bagging. While the MyPDF version is probably a little bit better, especially since the dense red cluster in the top right is better fit, the results are only slight differences in 3 areas of the decision boundary. You can also look at the non bagged version and see very clearly how the bagged boundary came to be. Because the distributions overlap and form a complex boundary, the bias term in the error is playing a much larger role than in the previous synthetic data sets. This has caused the performance of C45 and MyPDF trees to about balance out, and the variance reduction abilities of Bagging just aren't enough to get a really good decision boundary like before.

So what about AdaBoost?

AdaBoost Tree C45

AdaBoost Tree MyPDF

AdaBoost, unlike bagging, does more then just reduce the variance. It also reduces bias by reweighing the data points based on previous errors. In this case, we can see that the classifier built with C45 is much cleaner than MyPDF. While the denser red band to the top right isn't perfect, MyPDF has clearly started to over fit and learn the noise in the model. MyPDF trees just weren't as good a weak learner for this task, because they were a bit too low bias. This allowed them to start overfitting after just a few iterations.

Just for fun, what if we use weak learners that are not strong enough? Bellow you can look at AdaBoost using Stumps.

AdaBoost Stump C45

AdaBoost Stump MyPDF

While the MyPDF results with AdaBoost are not great, C45 didn't seem to change at all! The boundary would look a little different for C45 if I stopped at just 4 or 8 iterations, but C45 stumps simply are not powerful enough to really learn better weights.

Because of the overlap, some of the blue points on the left get higher weights for being wrong, and the red on the right get higher weights. After a few rounds of this, when the weak learners get averaged out, the final decision boundary ends up almost exactly where a single stump would have placed it. Using the x axis as the feature never gets selected because it would grab the dense red strip, almost all the points bellow would be wrong for red, and a number of blue points near the dense region would be wrong. So C45 stumps get stuck moving left and right.

Hopefully, this last example makes it clear that bias and variance really is a trade off when we try to perform classification. A great thing about decision trees is that you can restrict the maximum depth of the tree. By increasing the max depth, you decrease the bias and increase the variance This gives us a very simple parameter to control the bias/variance trade off when using trees as a weak learner for Ensemble methods.

Also, in the real world, most problems have a combination of bias and variance in their error, unlike my synthetic data sets which are mostly variance. In practice C45 trees make excellent weak learners.

The one part of this equation we did not address was noise, which I hope to get to in a later post.

A very popular method in Machine Learning is to combine multiple classifiers into one stronger classifier, which is called Ensemble learning. Simple enough, but how do we know it will do well? Do the classifiers combined have to all be completely different, can we use the same one over and over? In this post I'll introduce the basic concept and cover a bit of it that I think a lot of people dont fully understand.

This first means detouring, what causes errors and how do we measure the errors? The standard metric people tend to use is the squared error. So if $f(x)$ is the true output of the function, $\epsilon$ is the noise in the functions output, and $h(x)$ is our approximation of the function, the expected error is then

$$Error = E[(f(x)+\epsilon - h(x))^2]$$

This can be decomposed into 3 parts.

$$Error = \underbrace{E[(f(x)-E[h(x)])^2]}_{Bias} + \underbrace{(h(x)-E[h(x)])^2}_{Variance} + \underbrace{E[\epsilon^2]}_{noise}$$

So what does this all tell us? The noise is just noise, there is not much we can do about it. If the noise dominates our error, there is nothing we can do from a learning perspective.

The bias is the squared output of the true function, minus the expected output of many approximations. Essentially, what is the squared difference between f(x) and the best approximation we could possible make.

The variance is the difference between an approximation against this best approximation.

This is also tells us we have a trade off to be made when creating an algorithm. Unless we switch to an entirely different algorithm (or use a meta algorithm like Ensemble methods), the only way to reduce the error is to adjust the bias and variance. However, its not possible to only reduce bias. Reducing the bias tends to increase the variance, and reducing variance increase bias.

To help emphasize this, consider two extreme learners $h$.

The learners always outputs the same class, no mater what the input is
The learner creates a random mapping from every possible input to one randomly chosen, valid output.

The bias for $h_1$ in this case would be very large. If $h_1$ returned the majority class no mater what, the bias is the 1 - the prior probability of the output class. The variance is zero though! It doesn't matter what we give the learner to learn on, its going to output the same thing! Clearly, unless the problem was very unbalanced, $h_1$ would produce bad results and the error is entirly in the bias term.

The bias for $h_2$ would be zero though! One of the random mappings would be a perfect mapping, so there is no bias - it can learn the true output function. However, the variance is huge! The number of incorrect mappings will be enormous - and most of them will be completely wrong. Again we would expect a very large error, but this time the error is all contained within the variance term.

So what does this have to do with Ensemble methods? Some of the most popular are Boosting, which take weak learners (learners that are not good on their own) and produce a final output that is very good. Reducing both bias and variance. So which learners work best? Ones that are low bias, high variance. Why? I will demonstrate with an example.

The precursor to boosting is a technique called Bagging. Bagging is a simple procedure, and works as follows.

Learning:

for n iterations:

Sample with replacement from the data set so that your sample is the same size of the data set.

Train a learner on the sample

Output:

Output the majority vote of the $n$ learned classifiers

That is it and its very simple. It increase accuracy by reducing variance. How? By approximating $ E[h(x)] \approx \hat{h}(x) $ empirically by combining multiple $h(x)$. Each bagged sample is different from the original data set and each other, so we get various learners that perform differently if $h(x)$ has a larger variance. The bias term would not be changed by this, but the variance term would go from $(h(x)-E[h(x)])^2$ to $(\hat{h}(x)-E[h(x)])^2$. If if our approximation is good, it should result in a smaller difference.

Ok, so I've argued that we can reduce the variance - but why should we use low bias / high variance weak learners?

Consider Bagging with the Nearest Neighbor classifier (1-NN). 1-NN is a low bias low variance classifier. Say we perform 100 bagging iterations. Each sample, since we sampled with replacement, can have duplicate data points in the sample. In the limit, we expect each sample to contain only 63% of the original data set, with the rest being duplicated. That means when we do 100 iterations of bagging, and we have some query $q$, we would expect around 63 of the bagged learners to contain the true nearest neighbor to $q$. Since we output the majority vote, and the majority is likely to have the same nearest neighbor, we will output the same result as just regular Nearest Neighbor.

Clearly, now we can see that if the variance isn't large enough, we get no improvement at all! So we want weak learners that have variance. Why low bias ones instead of high bias ones? Intuitively it is better to start closer to the solution (low bias) than farther. Practically, if the bias is too high, many Ensemble methods will fail to converge, resulting in an Ensemble of size 1 - which is just a single weak learner! There is also the issue that the more weak learners we need, the longer it will take to classify. So if we can use less weak learners and get the same result, that is generally better.

Lastly, we also now have a simple method for trying to build a better algorithm. Take an algorithm A. Now adjust it so we have A' which reduces its bias and increases it variance. Use bagging to reduce the variance back down. Now, hopefully, both the bias and variance are lower than when we started.

Machine Learning with JSAT

Saturday, November 24, 2012

Bias, Variance & Ensembles: Some Examples

Monday, November 19, 2012

Bias, Variance & Ensembles

About Me