Friday, May 3, 2013

Fast Euclidean Distances

Every now and then, I always see one particular optimization for computing the euclidean distance between a query and a fixed set of points. It boils down to taking the square of the norm,

$$d(x, y)^2 = ||x-y||^2 = x \cdot x + y \cdot y - 2 x \cdot y$$

or if the squared term makes you sad

$$d(x, y) = \sqrt{ x \cdot x + y \cdot y - 2 x \cdot y}$$

And that is it. Its very simple and allows us to express everything as dot products (you may also notice this means any kernel trick can be expressed as a valid distance metric). I've often seen it mentioned as a performance improvement, but I've never seen it mentioned alone - its always with 5 or 10 other things.So I was wondering, how much of the performance comes from this trick such that it became the one I always see?

On its own it gets you no speed though, its only helpful if the same set of vectors are going to be compared against. That way you can cache the \(y \cdot y\) value for each vector. \(x \cdot x\) only has to be done once for the query vector, and can then be reused. Finally, the only thing left is \( 2 x \cdot y\) for each point, thats just 1 dot product per point instead of 3! (thats 3, not 3 factorial!)

I did some quick timing results to compare it against the naive method on a few data sets in a nearest neighbor search scheme. Time is in seconds, and measures how long it took to find the nearest neighbor for each testing point summed over 10 cross validation folds.

Data Set waveform-5000 ionosphere breast-wisconsin diabetes spambase
Naive Euclid Time 10.117  0.109 0.122 0.141 10.743
Fast Euclid Time 1.972 0.057 0.059 0.062 2.186
Speed Up 5.13
1.91
2.072.274.91

As you can see, its pretty good gain all on its own to use this trick. The tree smaller data sets (ionosphere, breast, diabetes) all got about 2 times faster, which I expect is mostly an avoidance of unnecessary squaring. With this method we get to do only multiplications until one square root at the end. Wave form and spambase are both larger with a lot of features, and I'm actually a little surprised that the speed up wasn't even larger for the spambase data set. Unlike waveform, spambase is sparse - so I was expecting it to get an even bigger boost by avoiding memory locality issues (cached dot products should always hit cache, rest is just a sparse dot product).

Overall, a good trick to use. I just recently added it to JSAT as its own vector collection (r691). What I need to do now is think of a good way to add it to things like VPTrees and RandomCoverBalls without duplicating code for each algorithm. Just another thing on my TODO list!

No comments:

Post a Comment