This first means detouring, what causes errors and how do we measure the errors? The standard metric people tend to use is the squared error. So if \(f(x)\) is the true output of the function, \(\epsilon\) is the noise in the functions output, and \(h(x)\) is our approximation of the function, the expected error is then
$$Error = E[(f(x)+\epsilon - h(x))^2]$$
This can be decomposed into 3 parts.
$$Error = \underbrace{E[(f(x)-E[h(x)])^2]}_{Bias} + \underbrace{(h(x)-E[h(x)])^2}_{Variance} + \underbrace{E[\epsilon^2]}_{noise}$$
So what does this all tell us? The noise is just noise, there is not much we can do about it. If the noise dominates our error, there is nothing we can do from a learning perspective.
The bias is the squared output of the true function, minus the expected output of many approximations. Essentially, what is the squared difference between f(x) and the best approximation we could possible make.
The variance is the difference between an approximation against this best approximation.
This is also tells us we have a trade off to be made when creating an algorithm. Unless we switch to an entirely different algorithm (or use a meta algorithm like Ensemble methods), the only way to reduce the error is to adjust the bias and variance. However, its not possible to only reduce bias. Reducing the bias tends to increase the variance, and reducing variance increase bias.
To help emphasize this, consider two extreme learners \(h\).
- The learners always outputs the same class, no mater what the input is
- The learner creates a random mapping from every possible input to one randomly chosen, valid output.
The bias for \(h_1\) in this case would be very large. If \(h_1\) returned the majority class no mater what, the bias is the 1 - the prior probability of the output class. The variance is zero though! It doesn't matter what we give the learner to learn on, its going to output the same thing! Clearly, unless the problem was very unbalanced, \(h_1\) would produce bad results and the error is entirly in the bias term.
The bias for \(h_2\) would be zero though! One of the random mappings would be a perfect mapping, so there is no bias - it can learn the true output function. However, the variance is huge! The number of incorrect mappings will be enormous - and most of them will be completely wrong. Again we would expect a very large error, but this time the error is all contained within the variance term.
So what does this have to do with Ensemble methods? Some of the most popular are Boosting, which take weak learners (learners that are not good on their own) and produce a final output that is very good. Reducing both bias and variance. So which learners work best? Ones that are low bias, high variance. Why? I will demonstrate with an example.
The precursor to boosting is a technique called Bagging. Bagging is a simple procedure, and works as follows.
Learning:
for n iterations:
Sample with replacement from the data set so that your sample is the same size of the data set.
Train a learner on the sample
Output:
Output the majority vote of the \(n\) learned classifiers
That is it and its very simple. It increase accuracy by reducing variance. How? By approximating \( E[h(x)] \approx \hat{h}(x) \) empirically by combining multiple \(h(x)\). Each bagged sample is different from the original data set and each other, so we get various learners that perform differently if \(h(x)\) has a larger variance. The bias term would not be changed by this, but the variance term would go from \((h(x)-E[h(x)])^2\) to \((\hat{h}(x)-E[h(x)])^2\). If if our approximation is good, it should result in a smaller difference.
Ok, so I've argued that we can reduce the variance - but why should we use low bias / high variance weak learners?
Consider Bagging with the Nearest Neighbor classifier (1-NN). 1-NN is a low bias low variance classifier. Say we perform 100 bagging iterations. Each sample, since we sampled with replacement, can have duplicate data points in the sample. In the limit, we expect each sample to contain only 63% of the original data set, with the rest being duplicated. That means when we do 100 iterations of bagging, and we have some query \(q\), we would expect around 63 of the bagged learners to contain the true nearest neighbor to \(q\). Since we output the majority vote, and the majority is likely to have the same nearest neighbor, we will output the same result as just regular Nearest Neighbor.
Clearly, now we can see that if the variance isn't large enough, we get no improvement at all! So we want weak learners that have variance. Why low bias ones instead of high bias ones? Intuitively it is better to start closer to the solution (low bias) than farther. Practically, if the bias is too high, many Ensemble methods will fail to converge, resulting in an Ensemble of size 1 - which is just a single weak learner! There is also the issue that the more weak learners we need, the longer it will take to classify. So if we can use less weak learners and get the same result, that is generally better.
Lastly, we also now have a simple method for trying to build a better algorithm. Take an algorithm A. Now adjust it so we have A' which reduces its bias and increases it variance. Use bagging to reduce the variance back down. Now, hopefully, both the bias and variance are lower than when we started.
No comments:
Post a Comment