Machine Learning with JSAT: May 2013

Tuesday, May 28, 2013

Distance Metric Acceleration for all

I recently made a post about performing faster Euclidean Distance computations. I was working on some similar code for accelerating Kernel computations for SVMs, and realized I could adapt the same code to distance computations! I just committed that change (r704), and despite modifying a large number of files - it actually doesn't take a lot to exploit this ability without any silly code duplication.

The old way might look something like this (a quick made up example):

DistanceMetric dm = new EuclideanDistance();
List<Vec> X = dataSet.getDataVectors();
double allPairAvg = 0;
for(int i = 0; i < X.size(); i++)
    for(int j = i+1; j < X.size(); j++)
        allPairAvg += dm.dist(X.get(i), X.get(j));

allPairAvg /= Math.pow(X.size(), 2)/2;

Which would compute the average distance between all point pairs. This code above will still work in JSAT, but you can now write the code as follows:

DistanceMetric dm = new EuclideanDistance();
List<Vec> X = dataSet.getDataVectors();
List<Double> distCache = dm.getAccelerationCache(X);
double allPairAvg = 0;
for(int i = 0; i < X.size(); i++)
    for(int j = i+1; j < X.size(); j++)
        allPairAvg += dm.dist(i, j, X, distCache);

allPairAvg /= Math.pow(X.size(), 2)/2;

As you can see, there is not a huge change in code here. What happens is that the distCache holds pre-computed information about each of the vectors. In this case, the Euclidean Distance holds the self dot product. This is then used in the method call to accelerate the computation of distances. What if the metric does not support the acceleration calls? By the interface definition, getAccelerationCache will return null. And when the method is called, it must check if distCache is null. If so, it then uses the list to compute the distance the normal way.

This makes it just as fast as when not supported, as the distance call for unsupported classes will just make the two dereferences like the first version. distCache will be null, so tis just an extra void reference on the stack. Nothing big.

When the acceleration is supported, distCache uses the DoubleList class in JSAT, which is essentially just a normal array of doubles wrapped by another object, so the memory overhead is very small.

I did have to strike a bit of a balance in the interface. The one importance case is if you have a vector y that is not in the original data set, that you want to compute the distance to many of the points in the data set. Some use cases might just need a single distance, some might need to do all but only need the minimum result, others might only compare agains $O(\log n)$ of the original vectors. So it needed to be decently versatile, and my solution ended up looking like this:

DistanceMetric dm = new EuclideanDistance();
List<Vec> X = dataSet.getDataVectors();
List<Double> distCache = dm.getAccelerationCache(X);
Vec y = //some vector from somewhere 
List<Double> qi = dm.getQueryInfo(y);
double yDistAvg = 0;
for(int i = 0; i < X.size(); i++)
    yDistAvg += dm.dist(i, y, qi, X, distCache);

yDistAvg /= X.size();

In this case line 5 gets its own information pre-computed, as if it was part of the original collection of vectors. We then simply provide that information when we do the distance computation Once again, if the metric does not support acceleration, getQueryInfo will return null and the method computes the distance the normal way if distCache is null. In this way the same code works in all cases, and you dont have to have any special cases or branching in the code you write. The nature of the branching done behind the scenes is consistent though, and very easy for the CPU branch predictor, and even the JIT to eliminate all together.

There is a little overhead in using a double list is a bitch much since it is likely to have only one or two values stored in it, but its incredibly small relative to everything else going on - so I'm not worried about it.

To test it out, I re-ran some of the code from my previous experiment in seeing how fast k-means is. I've since implemented Hamerly's algorithm, and re-worked Elkan to be even more efficient The two left values are the two without acceleration, and the right 3 values the speed with accretion and the naive algorithm with acceleration.

Data Set	k	Elkan	Hamerly	Elkan w/ Cache	Hamerly w/ Cache	Naive w/ Cache
covtype n = 581,012 d = 54	7	18.679 s	39.319 s	7.259 s	3.845 s	14.309 s
covtype n = 581,012 d = 54	70	11 min 48 s	35 min 16 s	10 min 44 s	1 min 29 s	12 min 23 s
minst n = 60,000 d = 780	10	51.874 s	2 min 10 s	2.57 s	3.365 s	10.765 s
minst n = 60,000 d = 780	100	4 min 12 s	34 min 37 s	22.069 s	53.70 s	2 min 16 s

The point of Elkan's and Hamerly's algorithm is to avoid distance calculations, and the cache acceleration reduces that cost. This makes the naive algorithm surprisingly fast, much closer than it was before (I didn't feel like waiting hours for the unaccelerated version to run again). While no-longer orders of magnitudes faster, Elkan and Hamerly are only 2-10 times faster (which isn't bad!). An interesting case is Elkan's on the covtype data set for $k = 70$. Amusingly, Elkan avoids so many distance computations in that case (and they have become so cheap), that the bookkeeping was becoming the most expensive part. This doesn't usually happen, but its an interesting case where Hamerly's becomes more efficient.

Overall, this code gives a huge speed improvement in a lot of cases. And now that the code is inside the Vector Collections, almost all the rest of JSAT and anyone who uses a VC will automatically get these speed boosts when supported.

Monday, May 6, 2013

Approximate Maximal Margin Classification Algorithm

I've recently been adding a lot of online classifiers to JSAT, and was working on Approximate Maximal Margin Classification Algorithm (ALMA_p):

Gentile, C. (2002). A New Approximate Maximal Margin Classification Algorithm. The Journal of Machine Learning Research, 2, 213–242.

The authors describe the algorithm once in very general terms - and then refine their specification as they go. Making it a little confusing to read (at least to me). It appears at first glance that ALMA has 5 parameters to set ($p$, $q$ $\alpha$ $B$ and $C$) ! But this is not the case. First is $B$ and $C$ which are set to default values that guarantee convergence ( $B = \frac{1}{\alpha}$, and $C = \sqrt{2}$ ).
There is also the scary mapping function which makes computation difficult and slow (one reason why SMIDAS in JSAT does not work well for large problems) - but we can fix the $p$ value to whatever we like, and $p = 2$ results in the map function becoming identity, which allows us to do simple updates. When we fix $p = 2$, this also fixes $q = 2$ , because $q$ is actually dependent upon $p$. This leaves only $\alpha$, which can be tested with only a few values from the paper.

The other sort of weirdness in the paper (I assume the authors realized it for their implementation, but they never mention it), is that the averaged output they use comes for (almost) free in the kernel case.

$$ output_i(x) = \left(\sum_{k=1}^{m^{(i)}+1} c_k^{(i)} w_k^{(i)} \right) \cdot x $$

which can be just as easily written as

$$ output_i(x) = \sum_{k=1}^{m^{(i)}+1} c_k^{(i)} \left( w_k^{(i)} \cdot x \right) $$

This second form makes far more sense when implementing the averages output of the kernelized version, because it means we don't have to figure out how to averages the kernel spaces.

Then in Remark 5, we get the recursive definition of the current hyperplane (and another recursive definition to get us $N_{k+1}$ )

$$ w_{k+1} \cdot x = \frac{w_k \cdot x + \eta_k y_t \hat{x}_t \cdot x}{N_{k+1}} $$

The authors do mention that this gets us the kernel form since it is now expressing the hyperplane in dot products, they don't mention that this gets us the averaged (or voted) output result for free! We simply have to tally the averaged result as we compute every $w_k \cdot x$ in the recurrence relation. Thus saving us from having to keep any copies (linear case you have to keep and extra ${w_k}_{avg}$ around).

The paper defines $\hat{x}_t = \frac{x_t}{\sqrt{x_t \cdot x_t}}$, which gives the wrong impression that you would need to store a modified copy of the input vector - instead you can write

$$ w_{k+1} \cdot x = \frac{w_k \cdot x + \eta_k y_t \frac{1}{\sqrt{x_t \cdot x_t}} x_t \cdot x}{N_{k+1}} $$

and then cache the self dot products for the future.

Overall its a good paper and nice algorithm, though I'm not sure if I want to implement the generalized $p$ version for the linear case. You can find the linear and kernelized version of ALMA2 in JSAT since revision 692.

Friday, May 3, 2013

Fast Euclidean Distances

Every now and then, I always see one particular optimization for computing the euclidean distance between a query and a fixed set of points. It boils down to taking the square of the norm,

$$d(x, y)^2 = ||x-y||^2 = x \cdot x + y \cdot y - 2 x \cdot y$$

or if the squared term makes you sad

$$d(x, y) = \sqrt{ x \cdot x + y \cdot y - 2 x \cdot y}$$

And that is it. Its very simple and allows us to express everything as dot products (you may also notice this means any kernel trick can be expressed as a valid distance metric). I've often seen it mentioned as a performance improvement, but I've never seen it mentioned alone - its always with 5 or 10 other things.So I was wondering, how much of the performance comes from this trick such that it became the one I always see?

On its own it gets you no speed though, its only helpful if the same set of vectors are going to be compared against. That way you can cache the $y \cdot y$ value for each vector. $x \cdot x$ only has to be done once for the query vector, and can then be reused. Finally, the only thing left is $ 2 x \cdot y$ for each point, thats just 1 dot product per point instead of 3! (thats 3, not 3 factorial!)

I did some quick timing results to compare it against the naive method on a few data sets in a nearest neighbor search scheme. Time is in seconds, and measures how long it took to find the nearest neighbor for each testing point summed over 10 cross validation folds.

Data Set	waveform-5000	ionosphere	breast-wisconsin	diabetes	spambase
Naive Euclid Time	10.117	0.109	0.122	0.141	10.743
Fast Euclid Time	1.972	0.057	0.059	0.062	2.186
Speed Up	5.13	1.91	2.07	2.27	4.91

As you can see, its pretty good gain all on its own to use this trick. The tree smaller data sets (ionosphere, breast, diabetes) all got about 2 times faster, which I expect is mostly an avoidance of unnecessary squaring. With this method we get to do only multiplications until one square root at the end. Wave form and spambase are both larger with a lot of features, and I'm actually a little surprised that the speed up wasn't even larger for the spambase data set. Unlike waveform, spambase is sparse - so I was expecting it to get an even bigger boost by avoiding memory locality issues (cached dot products should always hit cache, rest is just a sparse dot product).

Overall, a good trick to use. I just recently added it to JSAT as its own vector collection (r691). What I need to do now is think of a good way to add it to things like VPTrees and RandomCoverBalls without duplicating code for each algorithm. Just another thing on my TODO list!