## Monday, July 22, 2013

### L1 Regularization with Superfluous Features

I've been working on adding some more Linear algorithms again, with a focus on making sure there was some L1 regularized learners in JSAT. This was the one area JSAT was really lacking in, and there really are not a lot of great algorithms for it. A number of the published papers I attempted implementing didn't even work well outside of a very narrow range.

The two algorithms I did implement that worked well were Binary Bayesian  Regression (BBR) and Sparse Truncated Gradient Descent (STGD). The former is a batch Logistic Regression algorithm and the latter an online algorithm for the squared loss. Both of which support L1 regularization and worked well for me once implemented.

I justed used BBR to create a fun graph. Consider a data set where all features are drawn from independent N(0, 1). Let only 5 of these features be relevant, and restrict yourself to only 50 data points. How do you find the relevant features as you keep adding more irrelevant dimensions? Unregularized learners will quickly degrade to random guessing or worse. L2 regularization is easy enough to do quickly, but still isn't strong enough to learn the right coefficients as the dimensionality increases.

This is where L1 regularization comes into play. While most papers mention the sparsity property of L1, the real power is that L1 gives us theoretical bounds on its performance in the face of irrelevant features. This is incredibly important as we collect more and more data with lots of features, where we don't really know which of the features are useful for our decision.

Its a fun plot to look at, and you can clearly see the L1 prior is helping maintain a reasonable level of performance. This graph is slightly unique in how I made the problem. Most L1 vs L2 graphs like this show the L1 prior doing much better, staying near the original accuracy. Often they construct the irrelevant features as sparse, with a binary 0/1 value with a small probability of being 1. The performance is still good, but its important to emphasis that the L1 prior isn't impervious to random features.

Below is the code to generate the same data, including some more that I didn't include. You can try it out to see how the regularization effects the problem, and even change the problem to the sparse 0/1 version and see how it holds up. Note: that BBR takes a while to converge for very small regularization values ($$\lambda \leq 10^{-4}$$ for me) for the L1 prior.

import java.text.DecimalFormat;
import java.util.Random;
import jsat.classifiers.CategoricalData;
import jsat.classifiers.ClassificationDataSet;
import jsat.classifiers.ClassificationModelEvaluation;
import jsat.classifiers.linear.BBR;
import jsat.linear.DenseVector;
import jsat.linear.Vec;
import jsat.utils.random.XORWOW;
import jsat.regression.LogisticRegression;

/**
*
* @author Edward Raff
*/
public class L1L2Comparison
{

/**
* @param args the command line arguments
*/
public static void main(String[] args)
{

int relevant = 5;
int classSize = 50;
int[] irrelevant = new int[]
{
0, 5, 15, 45, 95, 245, 495,
};

double[] regPenalties = new double[]
{
1e-4, 0.001, 0.01, 0.1, 0.5, 1.0, 5.0, 10.0
};

Random rand = new XORWOW();

double[] coef = new double[relevant];

for(int i = 0; i < relevant; i++)
coef[i] = rand.nextDouble()*10;

DecimalFormat df = new DecimalFormat("#.#########");

for(int i = 0; i < irrelevant.length; i++)
{
int D = irrelevant[i]+relevant;
ClassificationDataSet cds = new ClassificationDataSet(D, new CategoricalData[0], new CategoricalData(2));
for(int k = 0; k < classSize; k++)
{
Vec xP = new DenseVector(D);
for(int j = 0; j < D; j++)
xP.set(j, rand.nextGaussian());
double result = 0;
for(int j = 0; j < relevant; j++)
result += coef[j]*xP.get(j);
if(result > 0)
else
}

System.out.println("\n\nD: " + D);
LogisticRegression lr  = new LogisticRegression();

ClassificationModelEvaluation cmeLr = new ClassificationModelEvaluation(lr, cds);
cmeLr.evaluateCrossValidation(10);

System.out.println("UNIFORM: " + df.format(cmeLr.getErrorRate()));

System.out.print("REG: ");
for(double reg : regPenalties)
System.out.print(df.format(reg) + ", ");
System.out.print("\n L2: ");
for(double reg : regPenalties)
{
BBR bbr = new BBR(reg, 1000, BBR.Prior.GAUSSIAN);
ClassificationModelEvaluation cme = new ClassificationModelEvaluation(bbr, cds);
cme.evaluateCrossValidation(10);
System.out.print(df.format(cme.getErrorRate()) + ", ");
}
System.out.print("\n L1: ");
for(double reg : regPenalties)
{
BBR bbr = new BBR(reg, 1000, BBR.Prior.LAPLACE);
ClassificationModelEvaluation cme = new ClassificationModelEvaluation(bbr, cds);
cme.evaluateCrossValidation(10);
System.out.print(df.format(cme.getErrorRate()) + ", ");
}

}
}
}


## Sunday, June 23, 2013

### Cosine Similarity Locality Sensitive Hashing

I have been meaning to try implementing and learning more about Locality Sensitive Hashing (LSH) for a while now. The general goal of LSH is to have a hash function where you want collisions, and similar items will colide into the same bucket. That way, you can find similar points by just hashing into a bucket, instead of searching the whole data set.

Form my prior perusing, I found there is a lot of inconsistency in the definition of LSH between papers, and how they define the objective. So I started out with two of the earlier algorithms - one called E2LSH and one based on Random Projections (RP) [in revisions r723 and r725].

The Random Projections one is particularly interesting because, on its own, it actually isn't hashing similar items into the same bucket (at least, not often). It creates a signature and then you can do LSH again on the signature for the hamming distance (which is equivalent to the L1 distance since it is 0 and 1 values only). But if you dont want to do this tiered hashing, RP LSH can still be useful for doing brute force NN in a fast way with approximated values. This is because the number of different bits in the signature is related to the cosine similarity of the original vectors. So I decided to try it out on the 20 News Group data set.

I wrote the code for a simple loader that you can see below. Its very basic (and the file path should be a parameter) but I was being lazy. Its common practice to avoid the header information for the news group data set (so that your algoritm dose not associate the people with the news group, but the content). To do this I used a quick hack of just finding the "Lines:" line of the header. I ripped this code out of a bigger file with a lot of other testing / experiment code, so I didn't bother with the imports - but your IDE should be able to resolve them for you.

/*
* A simple example on how to load a text data set. The 20 News Groups data set is
* designed such that each folder is a different class, and each file in the folder
* contains a document.
*/

{

private File mainDirectory = new File("<your path to the data set>/20_newsgroup/");
private List<File> dirs;

{
super(tokenizer, weighting);
}

@Override
protected void setLabelInfo()
{
File[] list = mainDirectory.listFiles();
dirs = new ArrayList<File>();
for(File file : list)
if(file.isDirectory())

labelInfo = new CategoricalData(dirs.size());
labelInfo.setCategoryName("News Group");
for(int i = 0; i < dirs.size(); i++)
labelInfo.setOptionName(dirs.get(i).getName(), i);
}

@Override
{
for(int i = 0; i < dirs.size(); i++)
{
for(File doc : dirs.get(i).listFiles())
{
try
{
String line;
StringBuilder sb = new StringBuilder();

boolean startConsidering = false;
{
if(!startConsidering)
{
//the format is such that most of the header occurs
//before the "lines:" line, so this way we can skip
//most of the HTML header info without a lot of work.
if(line.trim().startsWith("Lines:"))
startConsidering = true;
continue;
}
sb.append(line).append(" ");
}

br.close();

}
catch (FileNotFoundException ex)
{
ex.printStackTrace();
}
catch(IOException ex)
{
ex.printStackTrace();
}
}
}
}

}



I then just did 10 fold cross validation on 7 nearest neighbors (chosen arbitrarily). I've graphed the results below for the Error rate, Training time, and Query time summed over the 10 folds. There are 19997 data points and 14572 features after I removed all words that occurred less than 10 times.

For the RP LSH, the naive set up requires building a large matrix in memory of random values, which takes up a lot of space! For doubles, using a 512 bit signature means 512 rows of a matrix with 14572 columns and 8 bytes per double brings up a total of just under 60 megabytes. For a 4096 bit signature, that means 460 megabytes. On my MacBook Air, that was the biggest I could do before paging started to occur with everything else I was running.

However, you dont have to explicitly keep this matrix in memory. One way is to generate every random value as needed. In JSAT I have a RandomMatrix and RandomVector classes explicitly for making this kind of code easier. Unfortunately generating random gaussian values is expensive, making that is very slow at runtime unless your data is incredibly sparse - or you only need the values infrequently. The code itself is also a little on the slow side unfortunately, which is something I'm still investigating (if you know of a good and fast hash function for an input of 3 integers, let me know!).

The other alternative is to use a technique called pooling. Essentially you create a pre-defined pool of unit normal values, and then index into the pool when the matrix needs to return a value. This way we can create a small (less than one megabyte) cache of values that we use for all indices. This works surprisingly well for sparse data sets, and is fairly robust to poor random number generation, so I was able to exploit that to make the code a lot faster.

So below is the Error Rate, Training, and Query time for the naive true NN, an in Memory LSH, and two pools of 104 and 105 elements each (80 kb and 780 kb respectively).

As you can see, NN isn't the most accurate method for this data set (other algorithms only get down to around 0.2 in terms of error). However, as we increase the signature size, it gets more an more accurate - and by 8192 (which is a considerable signature), the accuracies are comparable. The pooling method isn't much worse than the in memory version of RP LSH, with the 105 one almost identical in accuracy but using 500 times less memory. This also allow us to do even larger signatures than the in memory method.

Looking at query time, we can see that even with these large signatures, the LSH query time is less than half of the normal brute force method, which is pretty good. While JSAT isn't designed for it, this would be particularly important if the original data set didn't fit into memory, the signatures probably would. We can also see that the pooled versions are slightly faster in terms of query time, mostly due to better locality (IE: we dont have over 400 megabytes of matrix eating all the cache).

In training time, we see that the naive NN doesn't really have any, but LSH needs to construct the random matrix and then compute all the signatures. We also see the pooling method is a good deal faster than the in memory method, and the difference grows with the signature length. This is because the pooled version creates a fixed number of random gaussians independent of signature size, where the in memory one has to generate enough to fill the whole matrix.

Overall, its pretty easy to see how the cosine LSH could be useful. I'm not sure if such large signature lengths are generally needed, or an artifact of this data set in particular - its something to look at in the future though. But even with these long signatures, it is faster then normal brute force in terms of query performance, and the trade off in accuracy and runtime by signature length could be exploited in a lot of ways.

Below is the code for trying the different LSH schemes  just uncomment the one you want (and get your IDE to fill in the imports). You'll need the latest subversion revision (r725) for it to work.
public class NewsGroupsMain
{
public static void main(String[] args)
{
//create the tokenizer and stemmer
Tokenizer token = new NaiveTokenizer();
token = new StemmingTokenizer(new PorterStemmer(), token);

ClassificationDataSet cds = ngl.getDataSet();

//remove all features that occur less than 10 times in the whole data set
cds.applyTransform(ngl.getMinimumOccurrenceDTF(10).getTransform(cds));

CategoricalData pred = cds.getPredicting();

System.out.println("Data Set Size: " + cds.getSampleSize());
System.out.println("Features: " + cds.getNumNumericalVars());

Classifier classifier;

//NN the naive way (default collection used for high dimensional data is brute force)
//classifier = new NearestNeighbour(7, false, new CosineDistance());
//NN for the LSH with an in memory matrix
//classifier = new NearestNeighbour(7, false, new CosineDistance(), new RandomProjectionLSH.RandomProjectionLSHFactory<VecPaired<Vec, Double>>(16, true));
//NN for the LSH with a pool of 10^4
classifier = new NearestNeighbour(7, false, new CosineDistance(), new RandomProjectionLSH.RandomProjectionLSHFactory<VecPaired<Vec, Double>>(16, 1000));

ClassificationModelEvaluation cme = new ClassificationModelEvaluation(classifier, cds);
cme.evaluateCrossValidation(10);
System.out.println("Error Rate: " + cme.getErrorRate());

System.out.println("Training Time: " + cme.getTotalTrainingTime());
System.out.println("Testing Time: " + cme.getTotalClassificationTime());

cme.prettyPrintConfusionMatrix();
}
}


## Thursday, June 13, 2013

### Probability Calibration

One of the things that I've been meaning to add to JSAT for a while now is probability calibrations. Speciafically - when I was first learning about Platt's SMO algorithm a few years ago - I came across his paper for giving SVMs probabilistic outputs.

Platt, J. C. (1999). Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood MethodsAdvances in Large Margin Classifiers (pp. 61–74). MIT Press. Retrieved from here

As you can guess by the title, this allows you to calibrate the outputs of the SVM to produce probability estimates. Instead of taking the sign of the SVM's output as the class label, you essentially create a new data set that has the same labels, but with one dimension (the output of the SVM). You then train on this new data set, and feed the output of the SVM as the input to this calibration method, which returns a probability. In Platt's case, we are essentially just performing logistic regression on the output of the SVM with respect to the true class labels.

You'll notice that this description only makes sense for binary classification problems, so I've created a BinaryScoreClassifier interface for algorithms that fit the bill. Then I implemented Platt's calibration scheme and another based on Isotonic Regression, which makes a weaker assumption than Platt's algorithm.

The main thing that needs to be added is support for cross validation in the calibration. The naive thing to do would be to train the model, and get the scores afterwords for each data point. This could lead to overfitting, and as Platt points out in his papper - all the support vectors will get scores of -1 or 1, which would be rare in practice. To combat this, you can do just 3 folds of cross validation - where you train on 2 folds and get the scores for the 3rd fold.

So here is a quick look at the difference between this naive method and using cross validation (CV) to fit the calibrations.
 Fitting SVM SVM + Platt SVM + Isotonic Naive CV * same as above

As you can see, the CV fittings are general better. For Platt, the top left area is getting a bit too faded for the naive method, and the Isotonic hasn't really learned any probability range.

However, overfitting is not as big an issue with these calibration schemes as one might think. The decision surface itself is learned by the underlying algorithm, and our calibration has no impact on this. Because we are learning a secondary problem on the output of the original learning algorithm, the most damage we can do is shift the decision surface too far in one direction or the other (ie: add a bias intercept so we make the boundary at 25 instead of 0).

For Platt's algorithm, even if we use the naive fitting scheme, the decision surface is going to be biased towards staying where it is, with the 1 and -1 support vectors equidistant on either side of the surface. This, in fact, gives an opportunity for the calibration to actually fix mistakes made by the underlying algorithm with regards to where the decision cut off should be. This is where the improvements in classification accuracy in Platt's paper come from(even if we do stupid things like over-fit the calibration).

As an example, here is the same data set with naive calibration with bad configuration settings for the SVM.

 Fitting SVM SVM + Platt SVM + Isotonic Naive

Again we can see Platt and Isotonic are over-fitting a bit, but we can see they are both better than the initial SVM surface.

Despite Platt's suggestion in his paper, I actually prefer starting with the naive method and switching to cross validation only once I feel I need it. Part of this is speed. If you are using a classifier wrapped in GridSearch, then you have to do cross validation with grid search 3 times (to get the calibration data) before learning the final classifier with another round of grid search to learn the final classifier (otherwise there is leaked information in the parameters selected by the Grid Search). The other reason being, as I mentioned above, the nature of the damdage you can do is fixed. If accuracy drops suddenly afterwards, you know exactly what happend - and can then switch to the more robust methods.  Often enough, the naive method works just fine.

However, it is important to remember that the underlying algorithm has never changed. If you try to perform probability calibration on a noisy data set using SVM, you may get an even worse result simply because SVMs are not good at learning the decision surfaces of noisy data sets.

The other benefit is that I can use the BinaryScoreClassifier in the OneVsAll meta classifier to improve generalization. Before, it simply took the votes for each class and normalized - but there would be cases where no class would get any vote, or many classes would all vote: see the example below for linear and RBF SVM.

When a classifier implements the BinaryScoreClassifier, I can resolve disputes by selecting the class that has the highest score - or distance from the margin. This leads to much cleaner decision surfaces (you also get much more milage out of linear classifiers).

I added this code in revision 716, which included editing a large number of linear classifiers to support the new interface. One of the things I may add is a static method for wrapping a classifier that natively supports probability estimates as a BinaryScoreClassifier. This way you can use the probability of class 0 as the score function, and re-calibrate the outputs using something more expressive like Isotonic Regression to increase the accuracy or just get better probability estimates.

## Tuesday, May 28, 2013

### Distance Metric Acceleration for all

I recently made a post about performing faster Euclidean Distance computations. I was working on some similar code for accelerating Kernel computations for SVMs, and realized I could adapt the same code to distance computations! I just committed that change (r704), and despite modifying a large number of files - it actually doesn't take a lot to exploit this ability without any silly code duplication.

The old way might look something like this (a quick made up example):

DistanceMetric dm = new EuclideanDistance();
List<Vec> X = dataSet.getDataVectors();
double allPairAvg = 0;
for(int i = 0; i < X.size(); i++)
for(int j = i+1; j < X.size(); j++)
allPairAvg += dm.dist(X.get(i), X.get(j));

allPairAvg /= Math.pow(X.size(), 2)/2;


Which would compute the average distance between all point pairs. This code above will still work in JSAT, but you can now write the code as follows:

DistanceMetric dm = new EuclideanDistance();
List<Vec> X = dataSet.getDataVectors();
List<Double> distCache = dm.getAccelerationCache(X);
double allPairAvg = 0;
for(int i = 0; i < X.size(); i++)
for(int j = i+1; j < X.size(); j++)
allPairAvg += dm.dist(i, j, X, distCache);

allPairAvg /= Math.pow(X.size(), 2)/2;


As you can see, there is not a huge change in code here. What happens is that the distCache holds pre-computed information about each of the vectors. In this case, the Euclidean Distance holds the self dot product. This is then used in the method call to accelerate the computation of distances. What if the metric does not support the acceleration calls? By the interface definition, getAccelerationCache will return null. And when the method is called, it must check if distCache is null. If so, it then uses the list to compute the distance the normal way.

This makes it just as fast as when not supported, as the distance call for unsupported classes will just make the two dereferences like the first version. distCache will be null, so tis just an extra void reference on the stack. Nothing big.

When the acceleration is supported, distCache uses the DoubleList class in JSAT, which is essentially just a normal array of doubles wrapped by another object, so the memory overhead is very small.

I did have to strike a bit of a balance in the interface. The one importance case is if you have a vector y that is not in the original data set, that you want to compute the distance to many of the points in the data set. Some use cases might just need a single distance, some might need to do all but only need the minimum result, others might only compare agains $$O(\log n)$$ of the original vectors. So it needed to be decently versatile, and my solution ended up looking like this:

DistanceMetric dm = new EuclideanDistance();
List<Vec> X = dataSet.getDataVectors();
List<Double> distCache = dm.getAccelerationCache(X);
Vec y = //some vector from somewhere
List<Double> qi = dm.getQueryInfo(y);
double yDistAvg = 0;
for(int i = 0; i < X.size(); i++)
yDistAvg += dm.dist(i, y, qi, X, distCache);

yDistAvg /= X.size();


In this case line 5 gets its own information pre-computed, as if it was part of the original collection of vectors. We then simply provide that information when we do the distance computation  Once again, if the metric does not support acceleration, getQueryInfo will return null and the method computes the distance the normal way if distCache is null. In this way the same code works in all cases, and you dont have to have any special cases or branching in the code you write. The nature of the branching done behind the scenes is consistent though, and very easy for the CPU branch predictor, and even the JIT to eliminate all together.

There is a little overhead in using a double list is a bitch much since it is likely to have only one or two values stored in it, but its incredibly small relative to everything else going on - so I'm not worried about it.

To test it out, I re-ran some of the code from my previous experiment in seeing how fast k-means is. I've since implemented Hamerly's algorithm, and re-worked Elkan to be even more efficient  The two left values are the two without acceleration, and the right 3 values the speed with accretion  and the naive algorithm with acceleration.

 Data Set k Elkan Hamerly Elkan w/ Cache Hamerly w/ Cache Naive w/ Cache covtype n = 581,012 d = 54 7 18.679 s 39.319 s 7.259 s 3.845 s 14.309 s covtype n = 581,012 d = 54 70 11 min 48 s 35 min 16 s 10 min 44 s 1 min 29 s 12 min 23 s minst n = 60,000 d = 780 10 51.874 s 2 min 10 s 2.57 s 3.365 s 10.765 s minst n = 60,000 d = 780 100 4 min 12 s 34 min 37 s 22.069 s 53.70 s 2 min 16 s

The point of Elkan's and Hamerly's algorithm is to avoid distance calculations, and the cache acceleration reduces that cost. This makes the naive algorithm surprisingly fast, much closer than it was before (I didn't feel like waiting hours for the unaccelerated version to run again). While no-longer orders of magnitudes faster, Elkan and Hamerly are only 2-10 times faster (which isn't bad!). An interesting case is Elkan's on the covtype data set for $$k = 70$$. Amusingly, Elkan avoids so many distance computations in that case (and they have become so cheap), that the bookkeeping was becoming the most expensive part. This doesn't usually happen, but its an interesting case where Hamerly's becomes more efficient.

Overall, this code gives a huge speed improvement in a lot of cases. And now that the code is inside the Vector Collections, almost all the rest of JSAT and anyone who uses a VC will automatically get these speed boosts when supported.

## Monday, May 6, 2013

### Approximate Maximal Margin Classification Algorithm

I've recently been adding a lot of online classifiers to JSAT, and was working on Approximate Maximal Margin Classification Algorithm (ALMAp):

Gentile, C. (2002). A New Approximate Maximal Margin Classification Algorithm. The Journal of Machine Learning Research, 2, 213–242.

The authors describe the algorithm once in very general terms - and then refine their specification as they go. Making it a little confusing to read (at least to me). It appears at first glance that ALMA has 5 parameters to set ($$p$$, $$q$$  $$\alpha$$  $$B$$  and $$C$$) ! But this is not the case. First is $$B$$ and $$C$$  which are set to default values that guarantee convergence ( $$B = \frac{1}{\alpha}$$, and $$C = \sqrt{2}$$ ).
There is also the scary mapping function which makes computation difficult and slow (one reason why SMIDAS in JSAT does not work well for large problems) - but we can fix the $$p$$ value to whatever we like, and $$p = 2$$ results in the map function becoming identity, which allows us to do simple updates. When we fix $$p = 2$$, this also fixes $$q = 2$$ , because $$q$$ is actually dependent upon $$p$$. This leaves only $$\alpha$$, which can be tested with only a few values from the paper.

The other sort of weirdness in the paper (I assume the authors realized it for their implementation, but they never mention it), is that the averaged output they use comes for (almost) free in the kernel case.

$$output_i(x) = \left(\sum_{k=1}^{m^{(i)}+1} c_k^{(i)} w_k^{(i)} \right) \cdot x$$

which can be just as easily written as

$$output_i(x) = \sum_{k=1}^{m^{(i)}+1} c_k^{(i)} \left( w_k^{(i)} \cdot x \right)$$

This second form makes far more sense when implementing the averages output of the kernelized version, because it means we don't have to figure out how to averages the kernel spaces.

Then in Remark 5, we get the recursive definition of the current hyperplane (and another recursive definition to get us $$N_{k+1}$$ )

$$w_{k+1} \cdot x = \frac{w_k \cdot x + \eta_k y_t \hat{x}_t \cdot x}{N_{k+1}}$$

The authors do mention that this gets us the kernel form since it is now expressing the hyperplane in dot products, they don't mention that this gets us the averaged (or voted) output result for free! We simply have to tally the averaged result as we compute every $$w_k \cdot x$$ in the recurrence relation. Thus saving us from having to keep any copies (linear case you have to keep and extra $${w_k}_{avg}$$ around).

The paper defines $$\hat{x}_t = \frac{x_t}{\sqrt{x_t \cdot x_t}}$$, which gives the wrong impression that you would need to store a modified copy of the input vector - instead you can write

$$w_{k+1} \cdot x = \frac{w_k \cdot x + \eta_k y_t \frac{1}{\sqrt{x_t \cdot x_t}} x_t \cdot x}{N_{k+1}}$$

and then cache the self dot products for the future.

Overall its a good paper and nice algorithm, though I'm not sure if I want to implement the generalized $$p$$ version for the linear case. You can find the linear and kernelized version of ALMA2 in JSAT since revision 692.

## Friday, May 3, 2013

### Fast Euclidean Distances

Every now and then, I always see one particular optimization for computing the euclidean distance between a query and a fixed set of points. It boils down to taking the square of the norm,

$$d(x, y)^2 = ||x-y||^2 = x \cdot x + y \cdot y - 2 x \cdot y$$

or if the squared term makes you sad

$$d(x, y) = \sqrt{ x \cdot x + y \cdot y - 2 x \cdot y}$$

And that is it. Its very simple and allows us to express everything as dot products (you may also notice this means any kernel trick can be expressed as a valid distance metric). I've often seen it mentioned as a performance improvement, but I've never seen it mentioned alone - its always with 5 or 10 other things.So I was wondering, how much of the performance comes from this trick such that it became the one I always see?

On its own it gets you no speed though, its only helpful if the same set of vectors are going to be compared against. That way you can cache the $$y \cdot y$$ value for each vector. $$x \cdot x$$ only has to be done once for the query vector, and can then be reused. Finally, the only thing left is $$2 x \cdot y$$ for each point, thats just 1 dot product per point instead of 3! (thats 3, not 3 factorial!)

I did some quick timing results to compare it against the naive method on a few data sets in a nearest neighbor search scheme. Time is in seconds, and measures how long it took to find the nearest neighbor for each testing point summed over 10 cross validation folds.

 Data Set waveform-5000 ionosphere breast-wisconsin diabetes spambase Naive Euclid Time 10.117 0.109 0.122 0.141 10.743 Fast Euclid Time 1.972 0.057 0.059 0.062 2.186 Speed Up 5.13 1.91 2.07 2.27 4.91

As you can see, its pretty good gain all on its own to use this trick. The tree smaller data sets (ionosphere, breast, diabetes) all got about 2 times faster, which I expect is mostly an avoidance of unnecessary squaring. With this method we get to do only multiplications until one square root at the end. Wave form and spambase are both larger with a lot of features, and I'm actually a little surprised that the speed up wasn't even larger for the spambase data set. Unlike waveform, spambase is sparse - so I was expecting it to get an even bigger boost by avoiding memory locality issues (cached dot products should always hit cache, rest is just a sparse dot product).

Overall, a good trick to use. I just recently added it to JSAT as its own vector collection (r691). What I need to do now is think of a good way to add it to things like VPTrees and RandomCoverBalls without duplicating code for each algorithm. Just another thing on my TODO list!

## Tuesday, March 12, 2013

### Neural Networks and Bias

Neural Networks (NN) haven't been sexy for a long time. I'm not talk about deep networks or autoencoders or any of the new very fancy NNs, but classic back projection trained feed forward neural networks. They are still very useful and powerful classifiers, and one of the first classifiers I ever implemented in JSAT. I also implemented it badly, and incorrectly. However, NNs are so powerful it was actually able to learn around the bugs in my code to still get decent results at times.

When I fist implemented them, I just assume that it took tons of neurons in the hidden layer to learn any function, even simple ones - because that was the results I had obtained. I also didn't know as much at the time about machine learning in general, how to think about the complexity of a decision boundary  how to interpret and investigate my results. So that not so great code has been sitting in JSAT for a while, and I've recently re written all the code from scratch. The new implementation is much cleaner and faster, and I figured I would go over a bit about how the weights are set up.

First is how Feed Forward networks are set up, as demonstrated by this beautiful diagram I have created.

This is a neural network with one input layer (the input is a vector of data points, the $$x$$), one hidden layer, and one output neuron. Every circle represents a "neuron". So this network would be predicting a regression problem. For classification problems, we tend to have $$C$$ outputs, one for each output class (or optionally, $$C-1$$ outputs - but I prefer encoding all C). There is one hidden layer with $$k$$ hidden units, and they feed into the output layer. You'll notice the input and hidden layer have a "+1" unit, which is a unit that always returns positive one. For simplicity, I havent put in every connection. In general we make each neuron in one layer connect to each and every neuron in the other layer.

So, in this example, the input to $$h_1 = x_0 \cdot w_{1,0} + x_1 \cdot w_{1,1} + \cdots + x_n \cdot w_{1,n} + 1 \cdot w_{1,bias} = \vec{x} \cdot \vec{w_1}$$
Because each neuron goes through this processes, we can compute each value as a dot product. And this can be represented as a matrix vector operation $$\vec{h} = W \vec{x}$$. But we dont want to manually add a bias term to everything, as that would take a lot of copying. Instead we move the bias weights out and use the "+1" bias term implicitly. So we end up $$\vec{h} = W \vec{x} + \vec{b}$$ where $$\vec{b}$$ contains the values of the bias weights. $$W$$ would be a k by d matrix. One row for each of the output values, and one column for each of the input values. $$\vec{b}$$ would also be of length k, since it needs to add one weight term to each output. The output $$\vec{h}$$ would then have the activation function applied to each of its values before being feed into the next layer.

This bias term was in fact the biggest mistake I made when I first implemented the Neural Network code. Instead of storing the weight terms separately,  I created a new vector with a bias term added - and adjusted the weights accordingly. This added the bias term for the input layer to the first hidden layer, but neglected a biast term from the hidden layer to the output layer. What ended up happening is that I would have to add a lot of extra neurons so that they could collectively learn an even more complicated decision boundary that would stretch out from the origin of the world to wrap around the points that needed to be classified.

I wont go over the rest of the algorithm, as you can find many many descriptions of it - but I wanted to go over how the linear algebra is set up because so many examples just drop the bias terms, and it was enough to confuse me when I first looked at Neural Networks.

So to end, below is a couple of comparisons between the new and old code. Each example is learned with 10 neurons, using the same number of iterations and learning rate for each algorithm. In addition  I also show the results for the old algorithm with 40 neurons to show what went wrong when it looks different from the 10 neuron case. The new code didn't use any of the new features that the old one did not have.

 New with 10 hidden neurons Old with 10 hidden neurons Old with 40 hidden neurons No Difference No Difference

As you can see, the bias makes a huge difference! I also didn't have this tool for visualizing 2 dimensional data sets. Despite the simplicity and naivety of 2D problems, it really allows your to see what is learned and confirm that your algorithm learns boundaries it should be able to handle with ease. You can see in a lot of the examples how the old code managed to learn something would result in a low training and test error, despite being a very bad decision boundary.

The new code is much cleaner and faster then the old code as well. It also includes the following new features:

• Momentum
• Weight Decay
• More Activation Functions
• Mini Batch Size
• Different Weight Initializations
• Learning Rate Decay
Which make a huge amount of difference in what you can learn and how quickly you can learn it. There is still more to add, including multicore support. However, it should all be much easier to add with the new code, including adding an sparse autoencoder on top of the same code.

## Wednesday, February 6, 2013

### Mini Batch K-Means vs Elkan

So there is a really great paper called Web-Scale K-Means Cluster by D. Sculley out of Google, and it introduces the Mini Batch k-Means algorithm. Its very clean and simple, something anyone could implement without even understanding it. However, there are a few very hand wavy parts of the paper that I don't like - so I decided I would investigate them quickly. For this investigation, I used k=100 clusters on the mnist data set. Like last time, I picked the initial means in a deterministic manner.

First, is the issue of parameter selection. All algorithms tend to have this problem, but in Mini Batch not only do we need to figure out how many $$k$$ we need, but also how many iterations to perform, and the batch size. The paper is loose on details, and simply says "many were tired, its robust, here is the value we are using". Which is actually a little clever, because they avoid the iterations issue by just measuring how long it takes to get to a score relative to their benchmark.

When looking at this problem though, its important to note that the update weight for a cluster center is based on how many updates it has received, and is cluster center dependent. So if we do 50 iterations and have 100 clusters, some (well, most) of them wont get updated on each iteration. This is okay because of the per center learning rate. But this also, to a large degree, makes the two variables unnecessary  Instead, we could look at total number of updates performed, where $$updates = iterations \cdot batch \ size$$.

Clearly, updates is a very good measure of progress, there is a lot of variance in the results of mini batch. If I plotted the errors bars, you would see the standard deviation overlapping for all of them (and it would be very ugly and hard to read). So I haven't included them. Instead, it is the mean that was plotted and used for all of these. So does the trade off have any impact on the standard deviation of the results?

Turns out, still not really. Once we hit a batch size equal to the number of clusters, the standard deviation was pretty stable - and starts to plateau very quickly. So in some ways this is a good mark in Mini Batch's favor, you just have to make sure you pick a batch size and iterations that gives enough updates in total. So how many updates should we do? This is closely tied with my other issue, and that is speed.

The paper claims Mini Batch as producing a much better solution than previous approximations, yet still being 2 to 4 orders of magnitude faster than regular k-means. The problem I have wit this, as I alluded to in my last post, is that Loyd's algorithm was their base case. So, below I plotted the time it took to get to solutions of varying quality over time (stopping point was number of iterations).

The plot is fairly similar to what was presented in the original paper, but there are two big differences. One, we can see the different batch sizes (with the same number of iterations) and that they all tend to overlap based on the total number of updates. So again, its fairly resilient against that. However, the exact k-means has shifted over to the left by 1 to 3 orders of magnitude! Turns out Mini Batch is about an order of magnitude faster in getting to a comparable solution as Elkan's algorithm. That is still very impressive but falls far short of the claims in Sculley's paper in my view.

With this, I want to draw attention to the following graph - where we plot the number of iterations against the time it took the algorithm.

The first thing we can see is that all the batch sizes start at approximately the same time. This is because when we only do one iteration, the vast majority of the time is assigning all the data points to the closest final cluster afterwards. We can use this to choose a good initial batch size and iterations. If we make it so that the updates is equal to the number of data points, we do half our time updating centers and half our time assigning centers. This is about where Elkan starts on the plot with 1 iteration. Looking at our graph, that would get us close to the best we can expect from Mini batch. So that would be one not bad heuristic to use.

The other thing to notice, is the plateau of Elkan in the graph. As we get to the very end, Elkan's triangle inequality bounds are pruning out almost all the data set - so only a few points get updated in each iteration, which is pretty cool I think.

## Wednesday, January 23, 2013

### K-Means and Fast K-Means

I've spent most of my time so far talking about classification, but another important task in Machine Learning is Clustering, which is a form of unsupervised learning. The k-Means algorithm is one of the oldest algorithms in this area, and also one of the most used.

k-means is a fairly simple algorithm, and works as follows. $$k$$ is the number of clusters we want, and is specified before hand.
1. By some method, select $$k$$ different points from the data set to be the initial $$k$$ means.
2. For each data point, find the closest mean $$k_i$$ and assign the point to that mean.
3. For each mean, recompute the mean from all the points assigned to it.
4. If any of the points changed clusters, repeat from step 2.
That is it! Despite its simplicity and naivety, the algorithm is used all the time for many different purposes. This particular naive implementation is often known as Lloyd's algorithm.

The main problem with this algorithm is that it can be a bit slow, and a lot of work has gone into implementing faster versions. Some try to get an exact result, and others attempt to get an approximate solution. Of the exact solutions, the one I see most often compared against is an algorithm by Hartigan and Wong from 1979. H&W report their algorithm to be up to twice as fast as the naive implementation, and its one of the most common benchmarks to beat when showing a new approximate or exact algorithm.

This has always bother me though, especially thanks to an algorithm by Elkan in 2003 that computers the exact k-means result by using the triangle inequality to prune out distance computations. In JSAT the naive algorithm is implemented as NaiveKMeans, and K-Means is an implementation of Elkan's algorithm.

Bellow are some performance comparisons between the two, with $$k$$ equal to the number of classes in the data set times 10. The seeds were selected in a deterministic manner, so that the initial points wouldn't add variance to the run time. The Euclidean distance was used for all problems.

Data SetkLloyd Run TimeElkan Run TimeElkan Speed Up
a9a
n = 32,561
d = 123
2
2.868 s
1.206 s
2.38
a9a
n = 32,561
d = 123
20
24.015 s
6.928 s
3.47
covtype
n = 581,012
d = 54
7
10 min 45.696 s
46.005 s
14
covtype
n = 581,012
d = 54
70
8 hrs 13 min 8.75 s
16 min 57.652 s
29
minst
n = 60,000
d = 780
10
15 min 22.244 s
2 min 23.583 s
6.42
minst
n = 60,000
d = 780
100
5 hrs 32 min 6.34 s
8 min 40 s
38

As you can see, Elkan's algorithm is faster - and much better behaved as we increased $$k$$! So why don't more people implement Elkan's? The only two minor downsides to the algorithm are:
• Its harder to implement in parallel, but JSAT gets linear speeds on my machines using a produce/consumer model just fine
• It requires $$O(n \cdot k +k^2)$$ extra memory. However, unless $$k$$ is getting very large, its really not an unbearable amount extra. Given the enormousness time savings and the cheapness of memory, I'd call that a good trade off.
Even still, there is another paper by Greg Hamerly that builds on Elkan's work. It is faster for low dimensional problems, and still very fast for higher dimensional ones. But more importantly, only needs $$O(n)$$ extra memory.

I haven't implemented Hamerly's algorithm yet (its on my giant TODO list). But I would like to see comparisons done against these two algorithms when people publish. In fact, in one of my next posts - I'll be doing this comparison for you!