Machine Learning with JSAT: August 2012

So I've talked about Nearest Neighbor and Kernel Density Estimators, but its often easier to understand things we can visualize. Well, I've been working on a tool for that recently - so I loaded up a data set from here, as its an excellent data set for demonstrative purposes. Its multi-modal, has a non-trivial decision boundary, and has no perfect separation in any higher dimension.

So here is the original data set

Original Data Set

First up, 1-NN.

1-NN Decision Boundary

We can clearly see that over fitting has occurred. There are lots of pockets of red and blue where there really shouldn't be because of noisy neighbors. A problem that always occurs if the two classes overlap.

Turns out 15-NN is much more reasonable.

15-NN Decision Boundary

The boundary is a little rough at times, but over all - its reflects the data very well. The cluster of the red class in the blue region is well formed and doesn't extend out too far. Good stuff.

However, we can also under-fit what if we took 150-NN?

150-NN Decision Boundary

While it looks cool, its not a great decision boundary. While the boundary makes a smooth transition, the red cluster in the top has taken over all the top - there just isn't enough blue points left to counter the very dense red cluster.

How does this compare to KDE then? Remember that NN will expand until it finds the points it needs, where as KDE fixes the volume - so it is possible that no data points will be found!

Here is the MetricKDE with the Epanechnikov kernel.

Pretty good stuff. We can see there is a little noise at the very edges of the world when we have very few points left, but other then that - the decision boundary is very good. We can compare this result with the 15-NN one. There are both pros and cons to being able to provide a best guess in uncharted territory (15-NN), or being able to return a "I really have no idea" as you can with the KDE.

I also said last time that the Kernel used doesn't have a huge impact. Was I lying? Well, here is using the same boundary using the Uniform kernel.

Uniform KDE

For the most part, its really the same. The transitions are not as smooth in the Uniform one. And we can see in the areas with outliers the difference in smoothness between the uniform kernel and Epanechnikov's.

What about the Gaussian?

Gauss KDE

Oh, what happened to the "I really have no idea" space? Well, the Gaussian kernel has infinite support, meaning it always returns a finite non zero value. So some points are always going to show up (without the trick mentioned in the prior post, every point would show up!). As such, we get every area colored, just like with the NN version.

But the Gaussian kernel has a few issues. It took several times longer to render then the others, the Gaussian kernel is more expensive to compute, and needs to be computed more often (because of the larger support). You may also notice the red dense region in the top is a bit wider then it should be. I mentioned last time that the heuristic used by MetricKDE to determine the right bandwidth is made for distributions with a support in the range of [-1,1]. Only about 68% of the points will fall in that range for the Gaussian, so that's what is causing that extra wideness there. Though you could implement a scaled version.

I'll look for some good data sets to demonstrate the differences (or lack there of) in using different distance metrics for a future post.

So I've mentioned Nearest Neighbor classification and Distance Metrics, now I'm going to talk about Kernel Density Estimators (KDE), which are also called Parzen Windows. KDE's are related to both of these things, but first - the single variable case.

Suppose we have a vector of sample values $\vec{x} = \{x_1, \cdots, x_n\}$ from some unknown distribution $D$. How do we query the pdf of $D$ if we can not fit it to one of the distribution we know? KDEs provide a way to do this. The idea is fairly simple, we take a query point $q$, and set its probability based on how many prior examples are near by. In this case, near by is based on some distance measure. So we might want to see how many points are $h$ units away. This differs from kNN, where we didn't care how far away our neighbors were, we just wanted $k$ of them. Turns out to be pretty simple, and the $pdf$ of $D$ can be estimated by

$$pdf(q) = \frac{1}{n \cdot h} \sum_{i=1}^{n} \underbrace{K \overbrace{\left(\frac{q-x_i}{h}\right)}^{Distance}}_{Kernel Evaluation}$$

Where $K$ is a Kernel function that is essentially just a symmetric probability distribution, meaning
$$K(x) = K(-x)$$ and $$\int_{-\infty}^{\infty}K(x) = 1$$

The kernel function is controlling how much weight each prior example adds to the pdf. Basically, we place a small, scaled down distribution on top of each point we have seen. Our final distribution is then the sum of all these smaller ones. $K$ controls the distribution and $h$ controls the scale.

The Gaussian distribution
$$K(x) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}x^2}$$
is a valid kernel function. I prefer to use kernels that only return a non zero value in the [-1,1] range. The simplest kernel function, is the 'parzen window', is one such kernel.
$$K(x) = \frac{1}{2}\ 1_{(|x|\leq1)}$$
And the default kernel in JSAT is the Epanechnikov kernel. For classification, this is my favorite kernel. It is theoretically optimal, provides different weights based on distance, and is cheap to evaluate.
$$K(x) = \frac{3}{4}(1-x^2)\ 1_{(|x|\leq1)}$$

In turn, the distance is scaled by the bandwidth parameter $h$, so that larger values mean a larger search. This will smooth out the distribution, but might lose detail (under fitting). Too small a value, and the distribution will spike only at points we have seen before (over fitting). There is a lot of work and theory behind selecting the best $h$ for the single variate case, but I'll leave that to you to look up. I'm more interested in the multivariate case. It turns out there are a number of ways to generalize KDEs to multiple dimensions. My preferred way is to replaced the distance in our first equation with a distance metric, which makes the transition very intuitive.

$$pdf(\vec{q}) = \frac{1}{n \cdot h^{|\vec{q}|}} \sum_{i=1}^{n} \underbrace{K \overbrace{\left(\frac{d(\vec{q},\vec{x}_i)}{h}\right)}^{Distance}}_{Kernel Evaluation}$$

First, this becomes much easier to implement. Instead of summing across our whole data set, we can do a range query on a VectorCollection for all neighbors within $h$. This saves a lot of computation time. Note, this is not technically valid as a pdf, as it wont integrate to 1. But if you re-normalize when comparing against other KDEs, you get a valid probability again.

Second, is that taking the distance in any direction can lead to worse performance. Most sources that go over KDEs spend a lot of time using matrices for the bandwidth instead of a single scalar, so that they can adapt to better fit the data. Using this generalization, we can put all of that logic into the distance metric instead. For example, we could use the Mahalanobis metric from my last post to help it adapt better to the data.

Now we can estimate the probability distribution of an n-dimensional data set. We can also see how this relates back to k-NN, where we had the forumla
$$ P(\vec{x}) = \frac{k}{n \cdot V} $$

Where:
$k$ is the number of neighbors we are considering
$n$ is the total number of data points in the set
$V$ is the volume of the area we are considering.

In kNN we fixed the number of neighbors we would consider, and let the volume grow until we found as many neighbors as we needed. With KDEs, we fix the volume using the bandwidth, and let the number of neighbors vary. We can then perform classification by fitting a KDE to each class, and choosing the class with the highest probability for each query point.

In the end, KDEs and kNN often have very similar performance once their parameters are optimally tuned, and each can perform better with less work depending on the data at hand. Some advantages of KDE that I like are:

Easy parameter estimation. For kNN, you just have to try many values of k in cross validation, which can be very slow. KDEs can get a very good estimate by simply taking the mean + 3 standard deviations of the 3rd nearest neighbor. Intuitively, if a query is on the edge of the distribution, this means you have a good probability of getting 3 neighbors in the range, and a very good chance of getting at least one. This is useful to ensure that you get a non-zero probability for points that should get one.
Smooth probability estimates. kNN return probabilities as the fraction of neighbors in the same class, meaning increments sof $1/k$. When using probabilities as a cut off, its annoying to have to remember that 0.9 isn't a possible probability if you only take 7 neighbors.
Easy anomaly detection. In kNN we expand the volume until we find our neighbors, in KDE the volume is fixed. That means there may be no neighbors nearby, so a probability of 0 gets returned. You can use this as a simple and intuitive form of anomaly detection during classification. (You may want to refer such unusual inputs to a human expert!)

The KDE I've described here is implemented in JSAT as the MetricKDE, and is the preferred one. Another KDE implementation, ProductKDE, is also available. It does not require an input kernel, however - it requires a lot more data to return good decision boundaries, and is considerably slower. So I do not recommend using it.

Machine Learning with JSAT

Sunday, August 26, 2012

Visualizing NN and KDE

Wednesday, August 8, 2012

Probability Estimation: Kernel Density Estimators

About Me