The phenomena of simply combining the results of multiple models to improve performance is well known, and many competition submissions are just the averaged (or majority vote) predictions from several created models. The difference with stacking is that it learns a weighted vote based on the performance of each classifier, which should perform better than any individual model in the ensemble (and better than a naive averaging).
However, there is no reason why the combing phase has to be a linear model. So in my implementation the base aggregating model can be specified. Though I have been using linear models for my toying around.
The other thing I've not seen mentioned is that Stacking can be done in an online fashion. By updating the aggregating model before the models being ensembles, you can get unbiassed updates to the aggregator. I've been combing this with Passive Aggressive models as the aggregators and been getting some great results. I use the PA models because they adapt and learn very quickly, and don't need any parameter tuning to work well.
To visually see that it was working, I created a simple ensemble of 3 models on a 2D dataset. My base models are online logistic kernel machines, with 3 different but poorly chosen RBF kernels. One is far too large, one is just a bit too large, and the final one was too small that it overfit.
Then I applied my online Stacking implementation. As you can see, it clearly learns a model that is better than any individual model from the ensemble. This can be very useful in scenarios that don't require real time predictions.
Online Stacking of 3 Logistic Kernel Machines |
In doing this I explicitly used a Logistic model because it produces output probabilities. This makes a huge difference in the performance of Stacking. Using the hinge loss just doesn't work as well, as show below.
Online Stacking of 3 Online SVMs |
This is because with the hinge, you only get a "yes/no" predication and we need to learn the weight for each model based on that alone. If a model knew that it dosn't know what to predict and returned probabilities indicating such, we could learn a better weight that takes into account the confidence in predictions. In the case of the model 'not knowing' its vote will get washed out when making a prediction by contributing equally to all classes - and we can learn to exploit this. Here we show the probabilities of the 3 models, and it becomes much easier to see how they could be combined in a way that improves the overall accuracy, giving the above result.
The wide model gives us the votes for the top and left/right sides of the space, the middle model gives us most of the edges around the data, and the left model gives us the votes near the borders of the classes (lightened by the smoother border of the middle model).
One could also imagine learning rules to exploit various correlations between model's predictions. This is why I've made the aggregating model configurable rather than force everyone to use linear aggregators.
So how does stacking stack up (har har) against the others - will usually beat them? Good first choice if you don't know what you are doing?
ReplyDeleteStacking' wont necessarily beat other ensemble methods like RF or AdaBoost given the same base models. But you can use Stacking to combine RF and AdaBoost into something more. Basically I don't quite see it as a competing technique.
ReplyDeleteI wouldn't call stacking a good first choice if you dont know what you are doing. Just taking the average/majority vote naively is the better 'I dont know what I'm doing' option.
But I think Stackin's advantage is in being able to *better* combine completely different models, and being able to combine models with different parameter values set. For example, if you do a grid search and naively combined the model votes - most of the models may do badly since their parameters are bad, giving performance worse than just picking the best model. Where Stacking would learn to not use those bad models, and instead combine the best of them. You could potentially do a courser search of parameters (ie: faster) and use Stacking to pick up the slack in not selecting as good a single parameter.