So here is the original data set
Original Data Set |
First up, 1-NN.
1-NN Decision Boundary |
We can clearly see that over fitting has occurred. There are lots of pockets of red and blue where there really shouldn't be because of noisy neighbors. A problem that always occurs if the two classes overlap.
Turns out 15-NN is much more reasonable.
15-NN Decision Boundary |
The boundary is a little rough at times, but over all - its reflects the data very well. The cluster of the red class in the blue region is well formed and doesn't extend out too far. Good stuff.
However, we can also under-fit what if we took 150-NN?
150-NN Decision Boundary |
While it looks cool, its not a great decision boundary. While the boundary makes a smooth transition, the red cluster in the top has taken over all the top - there just isn't enough blue points left to counter the very dense red cluster.
How does this compare to KDE then? Remember that NN will expand until it finds the points it needs, where as KDE fixes the volume - so it is possible that no data points will be found!
Here is the MetricKDE with the Epanechnikov kernel.
I also said last time that the Kernel used doesn't have a huge impact. Was I lying? Well, here is using the same boundary using the Uniform kernel.
Uniform KDE |
What about the Gaussian?
Gauss KDE |
Oh, what happened to the "I really have no idea" space? Well, the Gaussian kernel has infinite support, meaning it always returns a finite non zero value. So some points are always going to show up (without the trick mentioned in the prior post, every point would show up!). As such, we get every area colored, just like with the NN version.
But the Gaussian kernel has a few issues. It took several times longer to render then the others, the Gaussian kernel is more expensive to compute, and needs to be computed more often (because of the larger support). You may also notice the red dense region in the top is a bit wider then it should be. I mentioned last time that the heuristic used by MetricKDE to determine the right bandwidth is made for distributions with a support in the range of [-1,1]. Only about 68% of the points will fall in that range for the Gaussian, so that's what is causing that extra wideness there. Though you could implement a scaled version.
I'll look for some good data sets to demonstrate the differences (or lack there of) in using different distance metrics for a future post.
Hi, nice article! Just curious, how did you make these plots (e.g. what Tools-)?
ReplyDeleteHi Thomas, and thanks!
ReplyDeleteThe plots were made with a tool I'm building on top of JSAT. I'm a bit of a masochist, and have built GUI components from scratch in JSAT. You can find them under jsat.graphing.*
In specific, jsat.graphing.ClassificationPlot is used for these in particular. You could modify my example here to do classification, and play around with it yourself.
Thanks, will have a closer look.
ReplyDelete