A Decision Stump (or Stumps) are just a Decision Tree with only one level. That means a stump can only select one attribute to split on. I will be comparing Apples to Apples here, so the main difference is how splitting will be done. The only setting that will change when I compare 2 Stumps or Trees will be how splitting is done. The gain method, depth, pruning method, and everything else will be the same.
- C4.5 style splitting. Numeric attributes will be split by creating only a single binary divide that maximizes the gain.
- My PDF splitting method. Numeric attributes can split an arbitrary number of times. Splitting is done by approximating the PDF of the attribute for each class using a KDE, and splitting at the intersections of the PDFs. All splits for the same class map to one path, so multiples value ranges map to a single path. Thus, the splits map every data point to class that had the maximum probability in a given range when considering only one attribute. (Mine is a bit more complicated!)
The first method (C45) has a higher bias, since it can only do one split along the axis, and a lower variance.
The second method (MyPDF) has a lower bias, since it can adapt to the fluctuations in the PDF of the attribute in question. This increases the variance, since if the PDFs would be truly equal, random fluctuations may cause intersections that make us split for no good reason.
To start with, I'll give a simple example data set to illustrate the difference between these two methods using Stumps.
Here are the results for a 4 class problem, 1 class in each corner - generated from Gaussians. (Make sure to click on the images to get a better view!)
Clearly, the C45 produces a cleaner looking cut, but they would both actually have the same error rate in practice. So how does it look when we build a full tree?
|
|
Again, the C45 looks a lot better. But if you look at the MyPDF version, you can see it has adapted to have zero training errors. But this is a very simple problem that is well suited to splitting along the axis. Lets now look at something more complicated
The next few examples we will look at the difference between the two when using Bagging. Bagging (and boosting at the end) are always done with 40 iterations.
This first data set is a series of encompassing circles. The outer ones are very close too each other, and the inner ones are farther apart.
|
|
Here we can see they have similar performance. The two inner circles look better under MyPDF, but it also has some noise jutting out between the two outer ones. Now lets see how they look after 40 rounds of bagging.
|
|
The difference is dramatic! The Bagged version of C45 improved a little bit, with slightly more rounded boundaries in the outer circles. However MyPDF now has very good and much smoother boundaries for the outer circles. It has less overlap, and the boundary stays about in the middle where it belongs.
Another interesting example, this data set contains 3 classes growing out in a spiral. The complexity of the problem adds some more bias to the error term, but its still a simple data set. First, the normal Trees.
|
|
|
|
This one is more complicated. Both are still getting errors after bagging, but MyPDF has much more of a spiral shape to it. C45 is still very rigged. Because these artificial data sets have no noise in them, all the variance has to come out of the algorithm itself.
Lastly, lets look at the more complex data set from my previous post on KDE and k-NN. Here is the results for Trees on the same data set.
|
|
In this more complex data set, we can see the differences between the two are shrinking when using only bagging. While the MyPDF version is probably a little bit better, especially since the dense red cluster in the top right is better fit, the results are only slight differences in 3 areas of the decision boundary. You can also look at the non bagged version and see very clearly how the bagged boundary came to be. Because the distributions overlap and form a complex boundary, the bias term in the error is playing a much larger role than in the previous synthetic data sets. This has caused the performance of C45 and MyPDF trees to about balance out, and the variance reduction abilities of Bagging just aren't enough to get a really good decision boundary like before.
So what about AdaBoost?
AdaBoost, unlike bagging, does more then just reduce the variance. It also reduces bias by reweighing the data points based on previous errors. In this case, we can see that the classifier built with C45 is much cleaner than MyPDF. While the denser red band to the top right isn't perfect, MyPDF has clearly started to over fit and learn the noise in the model. MyPDF trees just weren't as good a weak learner for this task, because they were a bit too low bias. This allowed them to start overfitting after just a few iterations.
Just for fun, what if we use weak learners that are not strong enough? Bellow you can look at AdaBoost using Stumps.
Just for fun, what if we use weak learners that are not strong enough? Bellow you can look at AdaBoost using Stumps.
While the MyPDF results with AdaBoost are not great, C45 didn't seem to change at all! The boundary would look a little different for C45 if I stopped at just 4 or 8 iterations, but C45 stumps simply are not powerful enough to really learn better weights.
Because of the overlap, some of the blue points on the left get higher weights for being wrong, and the red on the right get higher weights. After a few rounds of this, when the weak learners get averaged out, the final decision boundary ends up almost exactly where a single stump would have placed it. Using the x axis as the feature never gets selected because it would grab the dense red strip, almost all the points bellow would be wrong for red, and a number of blue points near the dense region would be wrong. So C45 stumps get stuck moving left and right.
Hopefully, this last example makes it clear that bias and variance really is a trade off when we try to perform classification. A great thing about decision trees is that you can restrict the maximum depth of the tree. By increasing the max depth, you decrease the bias and increase the variance This gives us a very simple parameter to control the bias/variance trade off when using trees as a weak learner for Ensemble methods.
Also, in the real world, most problems have a combination of bias and variance in their error, unlike my synthetic data sets which are mostly variance. In practice C45 trees make excellent weak learners.
The one part of this equation we did not address was noise, which I hope to get to in a later post.
Because of the overlap, some of the blue points on the left get higher weights for being wrong, and the red on the right get higher weights. After a few rounds of this, when the weak learners get averaged out, the final decision boundary ends up almost exactly where a single stump would have placed it. Using the x axis as the feature never gets selected because it would grab the dense red strip, almost all the points bellow would be wrong for red, and a number of blue points near the dense region would be wrong. So C45 stumps get stuck moving left and right.
Hopefully, this last example makes it clear that bias and variance really is a trade off when we try to perform classification. A great thing about decision trees is that you can restrict the maximum depth of the tree. By increasing the max depth, you decrease the bias and increase the variance This gives us a very simple parameter to control the bias/variance trade off when using trees as a weak learner for Ensemble methods.
Also, in the real world, most problems have a combination of bias and variance in their error, unlike my synthetic data sets which are mostly variance. In practice C45 trees make excellent weak learners.
The one part of this equation we did not address was noise, which I hope to get to in a later post.