## Sunday, June 23, 2013

### Cosine Similarity Locality Sensitive Hashing

I have been meaning to try implementing and learning more about Locality Sensitive Hashing (LSH) for a while now. The general goal of LSH is to have a hash function where you want collisions, and similar items will colide into the same bucket. That way, you can find similar points by just hashing into a bucket, instead of searching the whole data set.

Form my prior perusing, I found there is a lot of inconsistency in the definition of LSH between papers, and how they define the objective. So I started out with two of the earlier algorithms - one called E2LSH and one based on Random Projections (RP) [in revisions r723 and r725].

The Random Projections one is particularly interesting because, on its own, it actually isn't hashing similar items into the same bucket (at least, not often). It creates a signature and then you can do LSH again on the signature for the hamming distance (which is equivalent to the L1 distance since it is 0 and 1 values only). But if you dont want to do this tiered hashing, RP LSH can still be useful for doing brute force NN in a fast way with approximated values. This is because the number of different bits in the signature is related to the cosine similarity of the original vectors. So I decided to try it out on the 20 News Group data set.

I wrote the code for a simple loader that you can see below. Its very basic (and the file path should be a parameter) but I was being lazy. Its common practice to avoid the header information for the news group data set (so that your algoritm dose not associate the people with the news group, but the content). To do this I used a quick hack of just finding the "Lines:" line of the header. I ripped this code out of a bigger file with a lot of other testing / experiment code, so I didn't bother with the imports - but your IDE should be able to resolve them for you.

/*
* A simple example on how to load a text data set. The 20 News Groups data set is
* designed such that each folder is a different class, and each file in the folder
* contains a document.
*/

{

private File mainDirectory = new File("<your path to the data set>/20_newsgroup/");
private List<File> dirs;

{
super(tokenizer, weighting);
}

@Override
protected void setLabelInfo()
{
File[] list = mainDirectory.listFiles();
dirs = new ArrayList<File>();
for(File file : list)
if(file.isDirectory())

labelInfo = new CategoricalData(dirs.size());
labelInfo.setCategoryName("News Group");
for(int i = 0; i < dirs.size(); i++)
labelInfo.setOptionName(dirs.get(i).getName(), i);
}

@Override
{
for(int i = 0; i < dirs.size(); i++)
{
for(File doc : dirs.get(i).listFiles())
{
try
{
String line;
StringBuilder sb = new StringBuilder();

boolean startConsidering = false;
{
if(!startConsidering)
{
//the format is such that most of the header occurs
//before the "lines:" line, so this way we can skip
//most of the HTML header info without a lot of work.
if(line.trim().startsWith("Lines:"))
startConsidering = true;
continue;
}
sb.append(line).append(" ");
}

br.close();

}
catch (FileNotFoundException ex)
{
ex.printStackTrace();
}
catch(IOException ex)
{
ex.printStackTrace();
}
}
}
}

}



I then just did 10 fold cross validation on 7 nearest neighbors (chosen arbitrarily). I've graphed the results below for the Error rate, Training time, and Query time summed over the 10 folds. There are 19997 data points and 14572 features after I removed all words that occurred less than 10 times.

For the RP LSH, the naive set up requires building a large matrix in memory of random values, which takes up a lot of space! For doubles, using a 512 bit signature means 512 rows of a matrix with 14572 columns and 8 bytes per double brings up a total of just under 60 megabytes. For a 4096 bit signature, that means 460 megabytes. On my MacBook Air, that was the biggest I could do before paging started to occur with everything else I was running.

However, you dont have to explicitly keep this matrix in memory. One way is to generate every random value as needed. In JSAT I have a RandomMatrix and RandomVector classes explicitly for making this kind of code easier. Unfortunately generating random gaussian values is expensive, making that is very slow at runtime unless your data is incredibly sparse - or you only need the values infrequently. The code itself is also a little on the slow side unfortunately, which is something I'm still investigating (if you know of a good and fast hash function for an input of 3 integers, let me know!).

The other alternative is to use a technique called pooling. Essentially you create a pre-defined pool of unit normal values, and then index into the pool when the matrix needs to return a value. This way we can create a small (less than one megabyte) cache of values that we use for all indices. This works surprisingly well for sparse data sets, and is fairly robust to poor random number generation, so I was able to exploit that to make the code a lot faster.

So below is the Error Rate, Training, and Query time for the naive true NN, an in Memory LSH, and two pools of 104 and 105 elements each (80 kb and 780 kb respectively).

As you can see, NN isn't the most accurate method for this data set (other algorithms only get down to around 0.2 in terms of error). However, as we increase the signature size, it gets more an more accurate - and by 8192 (which is a considerable signature), the accuracies are comparable. The pooling method isn't much worse than the in memory version of RP LSH, with the 105 one almost identical in accuracy but using 500 times less memory. This also allow us to do even larger signatures than the in memory method.

Looking at query time, we can see that even with these large signatures, the LSH query time is less than half of the normal brute force method, which is pretty good. While JSAT isn't designed for it, this would be particularly important if the original data set didn't fit into memory, the signatures probably would. We can also see that the pooled versions are slightly faster in terms of query time, mostly due to better locality (IE: we dont have over 400 megabytes of matrix eating all the cache).

In training time, we see that the naive NN doesn't really have any, but LSH needs to construct the random matrix and then compute all the signatures. We also see the pooling method is a good deal faster than the in memory method, and the difference grows with the signature length. This is because the pooled version creates a fixed number of random gaussians independent of signature size, where the in memory one has to generate enough to fill the whole matrix.

Overall, its pretty easy to see how the cosine LSH could be useful. I'm not sure if such large signature lengths are generally needed, or an artifact of this data set in particular - its something to look at in the future though. But even with these long signatures, it is faster then normal brute force in terms of query performance, and the trade off in accuracy and runtime by signature length could be exploited in a lot of ways.

Below is the code for trying the different LSH schemes  just uncomment the one you want (and get your IDE to fill in the imports). You'll need the latest subversion revision (r725) for it to work.
public class NewsGroupsMain
{
public static void main(String[] args)
{
//create the tokenizer and stemmer
Tokenizer token = new NaiveTokenizer();
token = new StemmingTokenizer(new PorterStemmer(), token);

ClassificationDataSet cds = ngl.getDataSet();

//remove all features that occur less than 10 times in the whole data set
cds.applyTransform(ngl.getMinimumOccurrenceDTF(10).getTransform(cds));

CategoricalData pred = cds.getPredicting();

System.out.println("Data Set Size: " + cds.getSampleSize());
System.out.println("Features: " + cds.getNumNumericalVars());

Classifier classifier;

//NN the naive way (default collection used for high dimensional data is brute force)
//classifier = new NearestNeighbour(7, false, new CosineDistance());
//NN for the LSH with an in memory matrix
//classifier = new NearestNeighbour(7, false, new CosineDistance(), new RandomProjectionLSH.RandomProjectionLSHFactory<VecPaired<Vec, Double>>(16, true));
//NN for the LSH with a pool of 10^4
classifier = new NearestNeighbour(7, false, new CosineDistance(), new RandomProjectionLSH.RandomProjectionLSHFactory<VecPaired<Vec, Double>>(16, 1000));

ClassificationModelEvaluation cme = new ClassificationModelEvaluation(classifier, cds);
cme.evaluateCrossValidation(10);
System.out.println("Error Rate: " + cme.getErrorRate());

System.out.println("Training Time: " + cme.getTotalTrainingTime());
System.out.println("Testing Time: " + cme.getTotalClassificationTime());

cme.prettyPrintConfusionMatrix();
}
}


## Thursday, June 13, 2013

### Probability Calibration

One of the things that I've been meaning to add to JSAT for a while now is probability calibrations. Speciafically - when I was first learning about Platt's SMO algorithm a few years ago - I came across his paper for giving SVMs probabilistic outputs.

Platt, J. C. (1999). Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood MethodsAdvances in Large Margin Classifiers (pp. 61–74). MIT Press. Retrieved from here

As you can guess by the title, this allows you to calibrate the outputs of the SVM to produce probability estimates. Instead of taking the sign of the SVM's output as the class label, you essentially create a new data set that has the same labels, but with one dimension (the output of the SVM). You then train on this new data set, and feed the output of the SVM as the input to this calibration method, which returns a probability. In Platt's case, we are essentially just performing logistic regression on the output of the SVM with respect to the true class labels.

You'll notice that this description only makes sense for binary classification problems, so I've created a BinaryScoreClassifier interface for algorithms that fit the bill. Then I implemented Platt's calibration scheme and another based on Isotonic Regression, which makes a weaker assumption than Platt's algorithm.

The main thing that needs to be added is support for cross validation in the calibration. The naive thing to do would be to train the model, and get the scores afterwords for each data point. This could lead to overfitting, and as Platt points out in his papper - all the support vectors will get scores of -1 or 1, which would be rare in practice. To combat this, you can do just 3 folds of cross validation - where you train on 2 folds and get the scores for the 3rd fold.

So here is a quick look at the difference between this naive method and using cross validation (CV) to fit the calibrations.
 Fitting SVM SVM + Platt SVM + Isotonic Naive CV * same as above

As you can see, the CV fittings are general better. For Platt, the top left area is getting a bit too faded for the naive method, and the Isotonic hasn't really learned any probability range.

However, overfitting is not as big an issue with these calibration schemes as one might think. The decision surface itself is learned by the underlying algorithm, and our calibration has no impact on this. Because we are learning a secondary problem on the output of the original learning algorithm, the most damage we can do is shift the decision surface too far in one direction or the other (ie: add a bias intercept so we make the boundary at 25 instead of 0).

For Platt's algorithm, even if we use the naive fitting scheme, the decision surface is going to be biased towards staying where it is, with the 1 and -1 support vectors equidistant on either side of the surface. This, in fact, gives an opportunity for the calibration to actually fix mistakes made by the underlying algorithm with regards to where the decision cut off should be. This is where the improvements in classification accuracy in Platt's paper come from(even if we do stupid things like over-fit the calibration).

As an example, here is the same data set with naive calibration with bad configuration settings for the SVM.

 Fitting SVM SVM + Platt SVM + Isotonic Naive

Again we can see Platt and Isotonic are over-fitting a bit, but we can see they are both better than the initial SVM surface.

Despite Platt's suggestion in his paper, I actually prefer starting with the naive method and switching to cross validation only once I feel I need it. Part of this is speed. If you are using a classifier wrapped in GridSearch, then you have to do cross validation with grid search 3 times (to get the calibration data) before learning the final classifier with another round of grid search to learn the final classifier (otherwise there is leaked information in the parameters selected by the Grid Search). The other reason being, as I mentioned above, the nature of the damdage you can do is fixed. If accuracy drops suddenly afterwards, you know exactly what happend - and can then switch to the more robust methods.  Often enough, the naive method works just fine.

However, it is important to remember that the underlying algorithm has never changed. If you try to perform probability calibration on a noisy data set using SVM, you may get an even worse result simply because SVMs are not good at learning the decision surfaces of noisy data sets.

The other benefit is that I can use the BinaryScoreClassifier in the OneVsAll meta classifier to improve generalization. Before, it simply took the votes for each class and normalized - but there would be cases where no class would get any vote, or many classes would all vote: see the example below for linear and RBF SVM.

When a classifier implements the BinaryScoreClassifier, I can resolve disputes by selecting the class that has the highest score - or distance from the margin. This leads to much cleaner decision surfaces (you also get much more milage out of linear classifiers).

I added this code in revision 716, which included editing a large number of linear classifiers to support the new interface. One of the things I may add is a static method for wrapping a classifier that natively supports probability estimates as a BinaryScoreClassifier. This way you can use the probability of class 0 as the score function, and re-calibrate the outputs using something more expressive like Isotonic Regression to increase the accuracy or just get better probability estimates.