I've been reading a recent paper that describes a new Vector Collection, called
Random Ball Cover. Its a bit clever, and aims to answer nearest neighbor queries in \(O(\sqrt{n})\) time, which is unusual. While the paper goes about it in a very worder manner, the purpose is so that the basic steps of the RBC algorithm will be the naive NN algorithm, which is trivial to implement in a very efficient manner.
I'm currently implementing it with some modifications, and I'll be discussing them below. Though I will provide some summary, you'll get a better understanding of what I'm doing by reading the paper first.
RBC is really two different algorithms, an exact version, and an approximate "one-shot" version. The exact version is simple. Take \(\sqrt{n}\) samples from the set of vectors and call them representatives, and then assign each other vector to its nearest representative. Each representative will be expected to have \(O(\sqrt{n})\) points it represents.
For a query \(q\), we find the closest representative \(r_q\). If \(\psi_r\) is the farthest distance from \(r_q\) to any of the points it represents, the following two inequalities can prune out the other representatives.
$$d(q, r) > d(q, r_q) + \psi_r$$
$$d(q, r) > 3 \cdot d(q, r_q)$$
If either of these is true, we can discard \(r\) and all the points it represents from consideration.
The original paper is obsessed with performing all operations in parallel. However, I don't believe a nearest neighbor based system should be used for tasks that need a timely answer. Instead, you can get the parallelism from parallel queries in a batch system. The storage requirements and slower responses from nearest neighbor systems are much better suited to this domain.
Dropping the need to do single queries in parallel, we can use the first equation to prune the individual points once we start looking that those owned by the representative. If \(L_r\) represents those owned by the representative \(r\), let \(\alpha\) be any point \( \in L_r\)
$$d(q, r) > d(q, r_q) + d(r, \alpha)$$
can be used to prune individual points. It is easy to see this is correct, simply assume all points in \(L_r\) that are farther away than \(\alpha\) are no longer in the set. The bound will still hold.
The first bound can also be used to do efficient range queries, but the author neglects this.
The one-shot version of RBC is very similar, and has very tight probabilistic bounds on selecting the correct point. Each representative grabs the first \(s\) points near it, where \(s\) is a new parameter that we can set to \(\sqrt{n}\) to obtain the very high probability of selecting the correct point. We then check \(r_q\) and all \(x \in L_r\) as potential nearest neighbors. No one else is considered, which avoids a lot of extra work.
The downside to the one-shot method is that it makes a poor approximation for range queries. Consider the query that is near the border and has a range that makes it extend into the radius of a second representative It seems we will likely miss points that are on the other side. This becomes more of a problem when you pick \(s << \sqrt{n}\). While there is still a good probability of getting the correct nearest neighbor, it seems much harder to get the correct range query results. Further, because each representative gets its \(s\) closest, the same points can be in multiple representatives. This means some points will be unrepresented.
I think the less impressive behavior of the one-shot version is why the range query isn't even mentioned. But then again, many papers only leave range queries as little more than a foot note. So far my tests haven't shown my unitition to be as bad as I though, but I'm still not quite comfortable using one-shot for such scenarios.