## Sunday, August 26, 2012

### Visualizing NN and KDE

So I've talked about Nearest Neighbor and Kernel Density Estimators, but its often easier to understand things we can visualize. Well, I've been working on a tool for that recently - so I loaded up a data set from here, as its an excellent data set for demonstrative purposes. Its multi-modal, has a non-trivial decision boundary, and has no perfect separation in any higher dimension.

So here is the original data set
 Original Data Set

First up,  1-NN.
 1-NN Decision Boundary

We can clearly see that over fitting has occurred. There are lots of pockets of red and blue where there really shouldn't be because of noisy neighbors. A problem that always occurs if the two classes overlap.

Turns out 15-NN is much more reasonable.
 15-NN Decision Boundary
The boundary is a little rough at times, but over all - its reflects the data very well. The cluster of the red class in the blue region is well formed and doesn't extend out too far. Good stuff.

However, we can also under-fit what if we took 150-NN?
 150-NN Decision Boundary
While it looks cool, its not a great decision boundary. While the boundary makes a smooth transition, the red cluster in the top has taken over all the top - there just isn't enough blue points left to counter the very dense red cluster.

How does this compare to KDE then? Remember that NN will expand until it finds the points it needs, where as KDE fixes the volume - so it is possible that no data points will be found!

Here is the MetricKDE with the Epanechnikov kernel.

Pretty good stuff. We can see there is a little noise at the very edges of the world when we have very few points left, but other then that - the decision boundary is very good. We can compare this result with the 15-NN one. There are both pros and cons to being able to provide a best guess in uncharted territory (15-NN), or being able to return a "I really have no idea" as you can with the KDE.

I also said last time that the Kernel used doesn't have a huge impact. Was I lying? Well, here is using the same boundary using the Uniform kernel.
 Uniform KDE
For the most part, its really the same. The transitions are not as smooth in the Uniform one. And we can see in the areas with outliers the difference in smoothness between the uniform kernel and Epanechnikov's.

 Gauss KDE
Oh, what happened to the "I really have no idea" space? Well, the Gaussian kernel has infinite support, meaning it always returns a finite non zero value. So some points are always going to show up (without the trick mentioned in the prior post, every point would show up!).  As such, we get every area colored, just like with the NN version.

But the Gaussian kernel has a few issues. It took several times longer to render then the others, the Gaussian kernel is more expensive to compute, and needs to be computed more often (because of the larger support). You may also notice the red dense region in the top is a bit wider then it should be. I mentioned last time that the heuristic used by MetricKDE to determine the right bandwidth is made for distributions with a support in the range of [-1,1]. Only about 68% of the points will fall in that range for the Gaussian, so that's what is causing that extra wideness there. Though you could implement a scaled version.

I'll look for some good data sets to demonstrate the differences (or lack there of) in using different distance metrics for a future post.