Thursday, June 13, 2013

Probability Calibration

One of the things that I've been meaning to add to JSAT for a while now is probability calibrations. Speciafically - when I was first learning about Platt's SMO algorithm a few years ago - I came across his paper for giving SVMs probabilistic outputs.

Platt, J. C. (1999). Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood MethodsAdvances in Large Margin Classifiers (pp. 61–74). MIT Press. Retrieved from here

As you can guess by the title, this allows you to calibrate the outputs of the SVM to produce probability estimates. Instead of taking the sign of the SVM's output as the class label, you essentially create a new data set that has the same labels, but with one dimension (the output of the SVM). You then train on this new data set, and feed the output of the SVM as the input to this calibration method, which returns a probability. In Platt's case, we are essentially just performing logistic regression on the output of the SVM with respect to the true class labels. 

You'll notice that this description only makes sense for binary classification problems, so I've created a BinaryScoreClassifier interface for algorithms that fit the bill. Then I implemented Platt's calibration scheme and another based on Isotonic Regression, which makes a weaker assumption than Platt's algorithm. 

The main thing that needs to be added is support for cross validation in the calibration. The naive thing to do would be to train the model, and get the scores afterwords for each data point. This could lead to overfitting, and as Platt points out in his papper - all the support vectors will get scores of -1 or 1, which would be rare in practice. To combat this, you can do just 3 folds of cross validation - where you train on 2 folds and get the scores for the 3rd fold.

So here is a quick look at the difference between this naive method and using cross validation (CV) to fit the calibrations.
Fitting SVM SVM + Platt SVM + Isotonic

CV * same as above

As you can see, the CV fittings are general better. For Platt, the top left area is getting a bit too faded for the naive method, and the Isotonic hasn't really learned any probability range.

However, overfitting is not as big an issue with these calibration schemes as one might think. The decision surface itself is learned by the underlying algorithm, and our calibration has no impact on this. Because we are learning a secondary problem on the output of the original learning algorithm, the most damage we can do is shift the decision surface too far in one direction or the other (ie: add a bias intercept so we make the boundary at 25 instead of 0).

For Platt's algorithm, even if we use the naive fitting scheme, the decision surface is going to be biased towards staying where it is, with the 1 and -1 support vectors equidistant on either side of the surface. This, in fact, gives an opportunity for the calibration to actually fix mistakes made by the underlying algorithm with regards to where the decision cut off should be. This is where the improvements in classification accuracy in Platt's paper come from(even if we do stupid things like over-fit the calibration).

As an example, here is the same data set with naive calibration with bad configuration settings for the SVM.

Fitting SVM SVM + Platt SVM + Isotonic

Again we can see Platt and Isotonic are over-fitting a bit, but we can see they are both better than the initial SVM surface.

Despite Platt's suggestion in his paper, I actually prefer starting with the naive method and switching to cross validation only once I feel I need it. Part of this is speed. If you are using a classifier wrapped in GridSearch, then you have to do cross validation with grid search 3 times (to get the calibration data) before learning the final classifier with another round of grid search to learn the final classifier (otherwise there is leaked information in the parameters selected by the Grid Search). The other reason being, as I mentioned above, the nature of the damdage you can do is fixed. If accuracy drops suddenly afterwards, you know exactly what happend - and can then switch to the more robust methods.  Often enough, the naive method works just fine.

However, it is important to remember that the underlying algorithm has never changed. If you try to perform probability calibration on a noisy data set using SVM, you may get an even worse result simply because SVMs are not good at learning the decision surfaces of noisy data sets.

The other benefit is that I can use the BinaryScoreClassifier in the OneVsAll meta classifier to improve generalization. Before, it simply took the votes for each class and normalized - but there would be cases where no class would get any vote, or many classes would all vote: see the example below for linear and RBF SVM.

When a classifier implements the BinaryScoreClassifier, I can resolve disputes by selecting the class that has the highest score - or distance from the margin. This leads to much cleaner decision surfaces (you also get much more milage out of linear classifiers).

I added this code in revision 716, which included editing a large number of linear classifiers to support the new interface. One of the things I may add is a static method for wrapping a classifier that natively supports probability estimates as a BinaryScoreClassifier. This way you can use the probability of class 0 as the score function, and re-calibrate the outputs using something more expressive like Isotonic Regression to increase the accuracy or just get better probability estimates.

1 comment:

  1. Great post. Clearly explains Platt's method, and provides lots of insight into its pros and cons.