$$d(x, y)^2 = ||x-y||^2 = x \cdot x + y \cdot y - 2 x \cdot y$$
or if the squared term makes you sad
$$d(x, y) = \sqrt{ x \cdot x + y \cdot y - 2 x \cdot y}$$
And that is it. Its very simple and allows us to express everything as dot products (you may also notice this means any kernel trick can be expressed as a valid distance metric). I've often seen it mentioned as a performance improvement, but I've never seen it mentioned alone - its always with 5 or 10 other things.So I was wondering, how much of the performance comes from this trick such that it became the one I always see?
On its own it gets you no speed though, its only helpful if the same set of vectors are going to be compared against. That way you can cache the \(y \cdot y\) value for each vector. \(x \cdot x\) only has to be done once for the query vector, and can then be reused. Finally, the only thing left is \( 2 x \cdot y\) for each point, thats just 1 dot product per point instead of 3! (thats 3, not 3 factorial!)
I did some quick timing results to compare it against the naive method on a few data sets in a nearest neighbor search scheme. Time is in seconds, and measures how long it took to find the nearest neighbor for each testing point summed over 10 cross validation folds.
Data Set | waveform-5000 | ionosphere | breast-wisconsin | diabetes | spambase |
Naive Euclid Time | 10.117 | 0.109 | 0.122 | 0.141 | 10.743 |
Fast Euclid Time | 1.972 | 0.057 | 0.059 | 0.062 | 2.186 |
Speed Up | 5.13 |
1.91
| 2.07 | 2.27 | 4.91 |
Overall, a good trick to use. I just recently added it to JSAT as its own vector collection (r691). What I need to do now is think of a good way to add it to things like VPTrees and RandomCoverBalls without duplicating code for each algorithm. Just another thing on my TODO list!
No comments:
Post a Comment