The phenomena of simply combining the results of multiple models to improve performance is well known, and many competition submissions are just the averaged (or majority vote) predictions from several created models. The difference with stacking is that it learns a weighted vote based on the performance of each classifier, which should perform better than any individual model in the ensemble (and better than a naive averaging).

However, there is no reason why the combing phase has to be a linear model. So in my implementation the base aggregating model can be specified. Though I have been using linear models for my toying around.

The other thing I've not seen mentioned is that Stacking can be done in an online fashion. By updating the aggregating model before the models being ensembles, you can get unbiassed updates to the aggregator. I've been combing this with Passive Aggressive models as the aggregators and been getting some great results. I use the PA models because they adapt and learn very quickly, and don't

*need*any parameter tuning to work well.

To visually see that it was working, I created a simple ensemble of 3 models on a 2D dataset. My base models are online logistic kernel machines, with 3 different but poorly chosen RBF kernels. One is far too large, one is just a bit too large, and the final one was too small that it overfit.

Then I applied my online Stacking implementation. As you can see, it clearly learns a model that is better than any individual model from the ensemble. This can be very useful in scenarios that don't require real time predictions.

Online Stacking of 3 Logistic Kernel Machines |

In doing this I explicitly used a Logistic model because it produces output probabilities. This makes a huge difference in the performance of Stacking. Using the hinge loss just doesn't work as well, as show below.

Online Stacking of 3 Online SVMs |

This is because with the hinge, you only get a "yes/no" predication and we need to learn the weight for each model based on that alone. If a model knew that it

*dosn't*know what to predict and returned probabilities indicating such, we could learn a better weight that takes into account the confidence in predictions. In the case of the model 'not knowing' its vote will get washed out when making a prediction by contributing equally to all classes - and we can learn to exploit this. Here we show the probabilities of the 3 models, and it becomes much easier to see how they could be combined in a way that improves the overall accuracy, giving the above result.

The wide model gives us the votes for the top and left/right sides of the space, the middle model gives us most of the edges around the data, and the left model gives us the votes near the borders of the classes (lightened by the smoother border of the middle model).

One could also imagine learning rules to exploit various correlations between model's predictions. This is why I've made the aggregating model configurable rather than force everyone to use linear aggregators.