Machine Learning with JSAT: October 2012

Many algorithms in Machine Learning have relationships with each other One algorithm in particular, the Perceptron, is related to a lot. You can show how it evolves into numerous different algorithms. Its also very closely related to the Support Vector Machine.

What makes the SVMs really useful for low dimension problems is that they can solve the problem in a higher dimensional space without explicitly forming the space, allowing them to work in infinit dimensional spaces! People often find this confusion, and if they aren't on the same page as what that means mathematically, dont see what that gains us. So I've taken a simple 2D data set and run the Perceptron algorithm on it 8 times. Starting with a polynomial of degree 1, I've explicitly formed the higher dimensional space and computed the results in that space. You can watch as the perceptron slowly gains the ability to model more and more complex boundaries, until it gets to a point where its able to get most of it right. Granted, the decision boundary isn't great - thats why we dont usually use the Perceptron, but hopefully it provides a bit of insight. The One-vs-All method was used since it is a multi class problem.

Degree 1 Polynomial

Degree 2 Polynomial

Degree 3 Polynomial

Degree 4 Polynomial

Degree 5 Polynomial

Degree 6 Polynomial

Degree 7 Polynomial

Degree 8 Polynomial

So why is the SVM so great if the Perceptron can do this as well? Part of the answer is again, the SVM can do this without explicitly forming the space like I did for the Perceptron. However, the Perceptron can be modified to work in the same way as well. The real advantage is that the SVM finds a boundary that keeps it half way between both of the classes. You can see the Perceptron having the problem above where it often overlaps onto another classes space.

I've been reading a recent paper that describes a new Vector Collection, called Random Ball Cover. Its a bit clever, and aims to answer nearest neighbor queries in $O(\sqrt{n})$ time, which is unusual. While the paper goes about it in a very worder manner, the purpose is so that the basic steps of the RBC algorithm will be the naive NN algorithm, which is trivial to implement in a very efficient manner.

I'm currently implementing it with some modifications, and I'll be discussing them below. Though I will provide some summary, you'll get a better understanding of what I'm doing by reading the paper first.

RBC is really two different algorithms, an exact version, and an approximate "one-shot" version. The exact version is simple. Take $\sqrt{n}$ samples from the set of vectors and call them representatives, and then assign each other vector to its nearest representative. Each representative will be expected to have $O(\sqrt{n})$ points it represents.

For a query $q$, we find the closest representative $r_q$. If $\psi_r$ is the farthest distance from $r_q$ to any of the points it represents, the following two inequalities can prune out the other representatives.
$$d(q, r) > d(q, r_q) + \psi_r$$
$$d(q, r) > 3 \cdot d(q, r_q)$$

If either of these is true, we can discard $r$ and all the points it represents from consideration.

The original paper is obsessed with performing all operations in parallel. However, I don't believe a nearest neighbor based system should be used for tasks that need a timely answer. Instead, you can get the parallelism from parallel queries in a batch system. The storage requirements and slower responses from nearest neighbor systems are much better suited to this domain.

Dropping the need to do single queries in parallel, we can use the first equation to prune the individual points once we start looking that those owned by the representative. If $L_r$ represents those owned by the representative $r$, let $\alpha$ be any point $ \in L_r$

$$d(q, r) > d(q, r_q) + d(r, \alpha)$$

can be used to prune individual points. It is easy to see this is correct, simply assume all points in $L_r$ that are farther away than $\alpha$ are no longer in the set. The bound will still hold.

The first bound can also be used to do efficient range queries, but the author neglects this.

The one-shot version of RBC is very similar, and has very tight probabilistic bounds on selecting the correct point. Each representative grabs the first $s$ points near it, where $s$ is a new parameter that we can set to $\sqrt{n}$ to obtain the very high probability of selecting the correct point. We then check $r_q$ and all $x \in L_r$ as potential nearest neighbors. No one else is considered, which avoids a lot of extra work.

The downside to the one-shot method is that it makes a poor approximation for range queries. Consider the query that is near the border and has a range that makes it extend into the radius of a second representative It seems we will likely miss points that are on the other side. This becomes more of a problem when you pick $s << \sqrt{n}$. While there is still a good probability of getting the correct nearest neighbor, it seems much harder to get the correct range query results. Further, because each representative gets its $s$ closest, the same points can be in multiple representatives. This means some points will be unrepresented.

I think the less impressive behavior of the one-shot version is why the range query isn't even mentioned. But then again, many papers only leave range queries as little more than a foot note. So far my tests haven't shown my unitition to be as bad as I though, but I'm still not quite comfortable using one-shot for such scenarios.

Machine Learning with JSAT

Tuesday, October 30, 2012

Motivation for the SVM via the Perceptron

Friday, October 12, 2012

Random Ball Cover

About Me