Wednesday, August 8, 2012

Probability Estimation: Kernel Density Estimators

So I've mentioned Nearest Neighbor classification and Distance Metrics, now I'm going to talk about Kernel Density Estimators (KDE), which are also called Parzen Windows. KDE's are related to both of these things, but first - the single variable case.

Suppose we have a vector of sample values \(\vec{x} = \{x_1, \cdots, x_n\}\) from some unknown distribution \(D\). How do we query the pdf of \(D\) if we can not fit it to one of the distribution we know? KDEs provide a way to do this. The idea is fairly simple, we take a query point \(q\), and set its probability based on how many prior examples are near by. In this case, near by is based on some distance measure. So we might want to see how many points are \(h\) units away. This differs from kNN, where we didn't care how far away our neighbors were, we just wanted \(k\) of them. Turns out to be pretty simple, and the \(pdf\) of \(D\) can be estimated by

$$pdf(q) = \frac{1}{n \cdot h} \sum_{i=1}^{n} \underbrace{K \overbrace{\left(\frac{q-x_i}{h}\right)}^{Distance}}_{Kernel Evaluation}$$

Where \(K\) is a Kernel function that is essentially just a symmetric probability distribution, meaning
$$K(x) = K(-x)$$ and $$\int_{-\infty}^{\infty}K(x) = 1$$

The kernel function is controlling how much weight each prior example adds to the pdf. Basically, we place a small, scaled down distribution on top of each point we have seen. Our final distribution is then the sum of all these smaller ones. \(K\) controls the distribution and \(h\) controls the scale.

The Gaussian distribution
$$K(x) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}x^2}$$
is a valid kernel function. I prefer to use kernels that only return a non zero value in the [-1,1] range. The simplest kernel function, is the 'parzen window', is one such kernel.
$$K(x) = \frac{1}{2}\ 1_{(|x|\leq1)}$$
And the default kernel in JSAT is the Epanechnikov kernel. For classification, this is my favorite kernel. It is theoretically optimal, provides different weights based on distance, and is cheap to evaluate.
$$K(x) = \frac{3}{4}(1-x^2)\ 1_{(|x|\leq1)}$$

In turn, the distance is scaled by the bandwidth parameter \(h\), so that larger values mean a larger search. This will smooth out the distribution, but might lose detail (under fitting). Too small a value, and the distribution will spike only at points we have seen before (over fitting). There is a lot of work and theory behind selecting the best \(h\) for the single variate case, but I'll leave that to you to look up. I'm more interested in the multivariate case. It turns out there are a number of ways to generalize KDEs to multiple dimensions. My preferred way is to replaced the distance in our first equation with a distance metric, which makes the transition very intuitive.

$$pdf(\vec{q}) = \frac{1}{n \cdot h^{|\vec{q}|}} \sum_{i=1}^{n} \underbrace{K \overbrace{\left(\frac{d(\vec{q},\vec{x}_i)}{h}\right)}^{Distance}}_{Kernel Evaluation}$$

First, this becomes much easier to implement. Instead of summing across our whole data set, we can do a range query on a VectorCollection for all neighbors within \(h\). This saves a lot of computation time. Note, this is not technically valid as a pdf, as it wont integrate to 1. But if you re-normalize when comparing against other KDEs, you get a valid probability again.

Second, is that taking the distance in any direction can lead to worse performance. Most sources that go over KDEs spend a lot of time using matrices for the bandwidth instead of a single scalar, so that they can adapt to better fit the data. Using this generalization, we can put all of that logic into the distance metric instead. For example, we could use the Mahalanobis metric from my last post to help it adapt better to the data.

Now we can estimate the probability distribution of an n-dimensional data set. We can also see how this relates back to k-NN, where we had the forumla
$$ P(\vec{x}) = \frac{k}{n \cdot V} $$

\(k\) is the number of neighbors we are considering
\(n\) is the total number of data points in the set
\(V\) is the volume of the area we are considering.

In kNN we fixed the number of neighbors we would consider, and let the volume grow until we found as many neighbors as we needed. With KDEs, we fix the volume using the bandwidth, and let the number of neighbors vary. We can then perform classification by fitting a KDE to each class, and choosing the class with the highest probability for each query point.

In the end, KDEs and kNN often have very similar performance once their parameters are optimally tuned, and each can perform better with less work depending on the data at hand. Some advantages of KDE that I like are:

  1. Easy parameter estimation. For kNN, you just have to try many values of k in cross validation, which can be very slow. KDEs can get a very good estimate by simply taking the mean + 3 standard deviations of the 3rd nearest neighbor. Intuitively, if a query is on the edge of the distribution, this means you have a good probability of getting 3 neighbors in the range, and a very good chance of getting at least one. This is useful to ensure that you get a non-zero probability for points that should get one. 
  2. Smooth probability estimates. kNN return probabilities as the fraction of neighbors in the same class, meaning increments sof \(1/k\). When using probabilities as a cut off, its annoying to have to remember that 0.9 isn't a possible probability if you only take 7 neighbors.
  3. Easy anomaly detection. In kNN we expand the volume until we find our neighbors, in KDE the volume is fixed. That means there may be no neighbors nearby, so a probability of 0 gets returned. You can use this as a simple and intuitive form of anomaly detection during classification. (You may want to refer such unusual inputs to a human expert!) 
The KDE I've described here is implemented in JSAT as the MetricKDE, and is the preferred one. Another KDE implementation, ProductKDE, is also available. It does not require an input kernel, however - it requires a lot more data to return good decision boundaries, and is considerably slower. So I do not recommend using it. 

1 comment:

  1. Hi Edward,

    I'm, so glad to have found your blog. You are the first one being able to explain ML without writing dozens of equations in the first paragraph. Thanks for that.

    Now I'm getting close to the solution of my problem. I want to determine the value of comic books based on some characteristic (say number of prints, year of publication, physical condition). As far as I understand I could use KDE as follows:
    - create tuples (char1, char2, char3, value) of my training set
    - train algorithm, i.e. MetricKDE
    - To get the value of a new comic having char1, char2, char3 I would have to estimate the value several time, until I find the tuple with the highest pdf.

    Is that correct? Does there already exist an algorithm finding the value with highest pdf?