Machine Learning with JSAT: 2012

Saturday, November 24, 2012

Bias, Variance & Ensembles: Some Examples

In my last post, I talked a bit about bias, variance, and their importance in Ensemble methods. In this post I'll provide some visual examples that will hopefully better illustrate the concept. To do this, I'll be using Decision Stumps and Decision Trees.

A Decision Stump (or Stumps) are just a Decision Tree with only one level. That means a stump can only select one attribute to split on. I will be comparing Apples to Apples here, so the main difference is how splitting will be done. The only setting that will change when I compare 2 Stumps or Trees will be how splitting is done. The gain method, depth, pruning method, and everything else will be the same.

C4.5 style splitting. Numeric attributes will be split by creating only a single binary divide that maximizes the gain.
My PDF splitting method. Numeric attributes can split an arbitrary number of times. Splitting is done by approximating the PDF of the attribute for each class using a KDE, and splitting at the intersections of the PDFs. All splits for the same class map to one path, so multiples value ranges map to a single path. Thus, the splits map every data point to class that had the maximum probability in a given range when considering only one attribute. (Mine is a bit more complicated!)

The first method (C45) has a higher bias, since it can only do one split along the axis, and a lower variance.

The second method (MyPDF) has a lower bias, since it can adapt to the fluctuations in the PDF of the attribute in question. This increases the variance, since if the PDFs would be truly equal, random fluctuations may cause intersections that make us split for no good reason.

To start with, I'll give a simple example data set to illustrate the difference between these two methods using Stumps.

Here are the results for a 4 class problem, 1 class in each corner - generated from Gaussians. (Make sure to click on the images to get a better view!)

Stumps C45

Stumps MyPDF

Clearly, the C45 produces a cleaner looking cut, but they would both actually have the same error rate in practice. So how does it look when we build a full tree?

Tree C45

Tree MyPDF

Again, the C45 looks a lot better. But if you look at the MyPDF version, you can see it has adapted to have zero training errors. But this is a very simple problem that is well suited to splitting along the axis. Lets now look at something more complicated

The next few examples we will look at the difference between the two when using Bagging. Bagging (and boosting at the end) are always done with 40 iterations.

This first data set is a series of encompassing circles. The outer ones are very close too each other, and the inner ones are farther apart.

Tree C45

Tree MyPDF

Here we can see they have similar performance. The two inner circles look better under MyPDF, but it also has some noise jutting out between the two outer ones. Now lets see how they look after 40 rounds of bagging.

Bagged Tree C45

Bagged Tree MyPDF

The difference is dramatic! The Bagged version of C45 improved a little bit, with slightly more rounded boundaries in the outer circles. However MyPDF now has very good and much smoother boundaries for the outer circles. It has less overlap, and the boundary stays about in the middle where it belongs.

Another interesting example, this data set contains 3 classes growing out in a spiral. The complexity of the problem adds some more bias to the error term, but its still a simple data set. First, the normal Trees.

Tree C45

Tree MyPDF

The differences are much more pronounced using this data set. C45 is cleaner, but not very spiral like. MyPDF is more spiral like, but very blocky. Another 40 rounds of Bagging latter:

Bagged Tree C45

Bagged Tree MyPDF

This one is more complicated. Both are still getting errors after bagging, but MyPDF has much more of a spiral shape to it. C45 is still very rigged. Because these artificial data sets have no noise in them, all the variance has to come out of the algorithm itself.

Lastly, lets look at the more complex data set from my previous post on KDE and k-NN. Here is the results for Trees on the same data set.

Tree C45

Tree MyPDF

Clearly, both could do better. There are lots of mistakes in each, but the general idea is there. Despite this, the boundaries are actually more similar than they were in previous problems.

Bagged Tree C45

Bagged Tree MyPDF

In this more complex data set, we can see the differences between the two are shrinking when using only bagging. While the MyPDF version is probably a little bit better, especially since the dense red cluster in the top right is better fit, the results are only slight differences in 3 areas of the decision boundary. You can also look at the non bagged version and see very clearly how the bagged boundary came to be. Because the distributions overlap and form a complex boundary, the bias term in the error is playing a much larger role than in the previous synthetic data sets. This has caused the performance of C45 and MyPDF trees to about balance out, and the variance reduction abilities of Bagging just aren't enough to get a really good decision boundary like before.

So what about AdaBoost?

AdaBoost Tree C45

AdaBoost Tree MyPDF

AdaBoost, unlike bagging, does more then just reduce the variance. It also reduces bias by reweighing the data points based on previous errors. In this case, we can see that the classifier built with C45 is much cleaner than MyPDF. While the denser red band to the top right isn't perfect, MyPDF has clearly started to over fit and learn the noise in the model. MyPDF trees just weren't as good a weak learner for this task, because they were a bit too low bias. This allowed them to start overfitting after just a few iterations.

Just for fun, what if we use weak learners that are not strong enough? Bellow you can look at AdaBoost using Stumps.

AdaBoost Stump C45

AdaBoost Stump MyPDF

While the MyPDF results with AdaBoost are not great, C45 didn't seem to change at all! The boundary would look a little different for C45 if I stopped at just 4 or 8 iterations, but C45 stumps simply are not powerful enough to really learn better weights.

Because of the overlap, some of the blue points on the left get higher weights for being wrong, and the red on the right get higher weights. After a few rounds of this, when the weak learners get averaged out, the final decision boundary ends up almost exactly where a single stump would have placed it. Using the x axis as the feature never gets selected because it would grab the dense red strip, almost all the points bellow would be wrong for red, and a number of blue points near the dense region would be wrong. So C45 stumps get stuck moving left and right.

Hopefully, this last example makes it clear that bias and variance really is a trade off when we try to perform classification. A great thing about decision trees is that you can restrict the maximum depth of the tree. By increasing the max depth, you decrease the bias and increase the variance This gives us a very simple parameter to control the bias/variance trade off when using trees as a weak learner for Ensemble methods.

Also, in the real world, most problems have a combination of bias and variance in their error, unlike my synthetic data sets which are mostly variance. In practice C45 trees make excellent weak learners.

The one part of this equation we did not address was noise, which I hope to get to in a later post.

Monday, November 19, 2012

Bias, Variance & Ensembles

A very popular method in Machine Learning is to combine multiple classifiers into one stronger classifier, which is called Ensemble learning. Simple enough, but how do we know it will do well? Do the classifiers combined have to all be completely different, can we use the same one over and over? In this post I'll introduce the basic concept and cover a bit of it that I think a lot of people dont fully understand.

This first means detouring, what causes errors and how do we measure the errors? The standard metric people tend to use is the squared error. So if $f(x)$ is the true output of the function, $\epsilon$ is the noise in the functions output, and $h(x)$ is our approximation of the function, the expected error is then

$$Error = E[(f(x)+\epsilon - h(x))^2]$$

This can be decomposed into 3 parts.

$$Error = \underbrace{E[(f(x)-E[h(x)])^2]}_{Bias} + \underbrace{(h(x)-E[h(x)])^2}_{Variance} + \underbrace{E[\epsilon^2]}_{noise}$$

So what does this all tell us? The noise is just noise, there is not much we can do about it. If the noise dominates our error, there is nothing we can do from a learning perspective.

The bias is the squared output of the true function, minus the expected output of many approximations. Essentially, what is the squared difference between f(x) and the best approximation we could possible make.

The variance is the difference between an approximation against this best approximation.

This is also tells us we have a trade off to be made when creating an algorithm. Unless we switch to an entirely different algorithm (or use a meta algorithm like Ensemble methods), the only way to reduce the error is to adjust the bias and variance. However, its not possible to only reduce bias. Reducing the bias tends to increase the variance, and reducing variance increase bias.

To help emphasize this, consider two extreme learners $h$.

The learners always outputs the same class, no mater what the input is
The learner creates a random mapping from every possible input to one randomly chosen, valid output.

The bias for $h_1$ in this case would be very large. If $h_1$ returned the majority class no mater what, the bias is the 1 - the prior probability of the output class. The variance is zero though! It doesn't matter what we give the learner to learn on, its going to output the same thing! Clearly, unless the problem was very unbalanced, $h_1$ would produce bad results and the error is entirly in the bias term.

The bias for $h_2$ would be zero though! One of the random mappings would be a perfect mapping, so there is no bias - it can learn the true output function. However, the variance is huge! The number of incorrect mappings will be enormous - and most of them will be completely wrong. Again we would expect a very large error, but this time the error is all contained within the variance term.

So what does this have to do with Ensemble methods? Some of the most popular are Boosting, which take weak learners (learners that are not good on their own) and produce a final output that is very good. Reducing both bias and variance. So which learners work best? Ones that are low bias, high variance. Why? I will demonstrate with an example.

The precursor to boosting is a technique called Bagging. Bagging is a simple procedure, and works as follows.

Learning:

for n iterations:

Sample with replacement from the data set so that your sample is the same size of the data set.

Train a learner on the sample

Output:

Output the majority vote of the $n$ learned classifiers

That is it and its very simple. It increase accuracy by reducing variance. How? By approximating $ E[h(x)] \approx \hat{h}(x) $ empirically by combining multiple $h(x)$. Each bagged sample is different from the original data set and each other, so we get various learners that perform differently if $h(x)$ has a larger variance. The bias term would not be changed by this, but the variance term would go from $(h(x)-E[h(x)])^2$ to $(\hat{h}(x)-E[h(x)])^2$. If if our approximation is good, it should result in a smaller difference.

Ok, so I've argued that we can reduce the variance - but why should we use low bias / high variance weak learners?

Consider Bagging with the Nearest Neighbor classifier (1-NN). 1-NN is a low bias low variance classifier. Say we perform 100 bagging iterations. Each sample, since we sampled with replacement, can have duplicate data points in the sample. In the limit, we expect each sample to contain only 63% of the original data set, with the rest being duplicated. That means when we do 100 iterations of bagging, and we have some query $q$, we would expect around 63 of the bagged learners to contain the true nearest neighbor to $q$. Since we output the majority vote, and the majority is likely to have the same nearest neighbor, we will output the same result as just regular Nearest Neighbor.

Clearly, now we can see that if the variance isn't large enough, we get no improvement at all! So we want weak learners that have variance. Why low bias ones instead of high bias ones? Intuitively it is better to start closer to the solution (low bias) than farther. Practically, if the bias is too high, many Ensemble methods will fail to converge, resulting in an Ensemble of size 1 - which is just a single weak learner! There is also the issue that the more weak learners we need, the longer it will take to classify. So if we can use less weak learners and get the same result, that is generally better.

Lastly, we also now have a simple method for trying to build a better algorithm. Take an algorithm A. Now adjust it so we have A' which reduces its bias and increases it variance. Use bagging to reduce the variance back down. Now, hopefully, both the bias and variance are lower than when we started.

Tuesday, October 30, 2012

Motivation for the SVM via the Perceptron

Many algorithms in Machine Learning have relationships with each other One algorithm in particular, the Perceptron, is related to a lot. You can show how it evolves into numerous different algorithms. Its also very closely related to the Support Vector Machine.

What makes the SVMs really useful for low dimension problems is that they can solve the problem in a higher dimensional space without explicitly forming the space, allowing them to work in infinit dimensional spaces! People often find this confusion, and if they aren't on the same page as what that means mathematically, dont see what that gains us. So I've taken a simple 2D data set and run the Perceptron algorithm on it 8 times. Starting with a polynomial of degree 1, I've explicitly formed the higher dimensional space and computed the results in that space. You can watch as the perceptron slowly gains the ability to model more and more complex boundaries, until it gets to a point where its able to get most of it right. Granted, the decision boundary isn't great - thats why we dont usually use the Perceptron, but hopefully it provides a bit of insight. The One-vs-All method was used since it is a multi class problem.

Degree 1 Polynomial

Degree 2 Polynomial

Degree 3 Polynomial

Degree 4 Polynomial

Degree 5 Polynomial

Degree 6 Polynomial

Degree 7 Polynomial

Degree 8 Polynomial

So why is the SVM so great if the Perceptron can do this as well? Part of the answer is again, the SVM can do this without explicitly forming the space like I did for the Perceptron. However, the Perceptron can be modified to work in the same way as well. The real advantage is that the SVM finds a boundary that keeps it half way between both of the classes. You can see the Perceptron having the problem above where it often overlaps onto another classes space.

Friday, October 12, 2012

Random Ball Cover

I've been reading a recent paper that describes a new Vector Collection, called Random Ball Cover. Its a bit clever, and aims to answer nearest neighbor queries in $O(\sqrt{n})$ time, which is unusual. While the paper goes about it in a very worder manner, the purpose is so that the basic steps of the RBC algorithm will be the naive NN algorithm, which is trivial to implement in a very efficient manner.

I'm currently implementing it with some modifications, and I'll be discussing them below. Though I will provide some summary, you'll get a better understanding of what I'm doing by reading the paper first.

RBC is really two different algorithms, an exact version, and an approximate "one-shot" version. The exact version is simple. Take $\sqrt{n}$ samples from the set of vectors and call them representatives, and then assign each other vector to its nearest representative. Each representative will be expected to have $O(\sqrt{n})$ points it represents.

For a query $q$, we find the closest representative $r_q$. If $\psi_r$ is the farthest distance from $r_q$ to any of the points it represents, the following two inequalities can prune out the other representatives.
$$d(q, r) > d(q, r_q) + \psi_r$$
$$d(q, r) > 3 \cdot d(q, r_q)$$

If either of these is true, we can discard $r$ and all the points it represents from consideration.

The original paper is obsessed with performing all operations in parallel. However, I don't believe a nearest neighbor based system should be used for tasks that need a timely answer. Instead, you can get the parallelism from parallel queries in a batch system. The storage requirements and slower responses from nearest neighbor systems are much better suited to this domain.

Dropping the need to do single queries in parallel, we can use the first equation to prune the individual points once we start looking that those owned by the representative. If $L_r$ represents those owned by the representative $r$, let $\alpha$ be any point $ \in L_r$

$$d(q, r) > d(q, r_q) + d(r, \alpha)$$

can be used to prune individual points. It is easy to see this is correct, simply assume all points in $L_r$ that are farther away than $\alpha$ are no longer in the set. The bound will still hold.

The first bound can also be used to do efficient range queries, but the author neglects this.

The one-shot version of RBC is very similar, and has very tight probabilistic bounds on selecting the correct point. Each representative grabs the first $s$ points near it, where $s$ is a new parameter that we can set to $\sqrt{n}$ to obtain the very high probability of selecting the correct point. We then check $r_q$ and all $x \in L_r$ as potential nearest neighbors. No one else is considered, which avoids a lot of extra work.

The downside to the one-shot method is that it makes a poor approximation for range queries. Consider the query that is near the border and has a range that makes it extend into the radius of a second representative It seems we will likely miss points that are on the other side. This becomes more of a problem when you pick $s << \sqrt{n}$. While there is still a good probability of getting the correct nearest neighbor, it seems much harder to get the correct range query results. Further, because each representative gets its $s$ closest, the same points can be in multiple representatives. This means some points will be unrepresented.

I think the less impressive behavior of the one-shot version is why the range query isn't even mentioned. But then again, many papers only leave range queries as little more than a foot note. So far my tests haven't shown my unitition to be as bad as I though, but I'm still not quite comfortable using one-shot for such scenarios.

Saturday, September 22, 2012

Vector Collections, KD & VP Tree timings

I've shown a bit of kNN and KDE, which both need to know how to query a metric space. While both of these algorithms are intrinsically dependent on the ability, many other algorithms in Machine Learning also make use of the ability to query a space of points and select the k-nearest or all the points within a certain range. So for this post, I'm going to go over a quick intro to the two most useful in JSAT.

JSAT encapsulates this functionality into a VectorCollection, so that the algorithm using this functionality can be independent of how this is done.

The naive way to do this, is of course, to linearly scan through the whole collection each time a request is made. The distance is computed for every point in the data set, and the points that matched are returned. If we have $n$ points, this clearly takes $O(n)$ time. There are also some unfortunate theoretical results about this. Given an arbitrary set of points in an arbitrary dimension, $O(n)$ is the theoretical best we can do.

However, theory and practice rarely match up. Despite this result, we can get an improvement on many data sets. The most important factor of whether we do beat this result though is how high the dimension of our data set is. For problems of this nature, 10 is sadly a fairly high dimension. Too high often for the first method I will talk about, KD-Trees.

KD-Tree stands for K-Dimensional tree, and is a very popular algorithm. It recursively splits the space along one of the axis. By selecting the split so that half of the points fall on one side of the plain, and the other half on the opposite side, it hopes to allow you to ignore whole portions of the space. If you cut out half at every step, you have to go down $O(\log n)$ trees to reach the bottom, which will hopefully be the nearest neighbor. Which axis to split? Well, you can just iterate through the axis in order, or select the axis with the most variance.

The intuition is fairly simple, but unfortunately - KD-trees only work well for small dimensions, usually less than 10. Part of the reason is because we split on axis, which are fixed, and can not adjust to the data at hand. When the dimensions get large, we may not even have enough points to even use all the axis, leaving a large amount of work left at the terminal nodes. When the stack then comes back up, the triangle inequality forces it to explore the other branches, unable to prune them out. Because of this extra work, KD-trees can be even slower than the naive algorithm. Selecting the attribute with the most variance also stops being helpful, as the variance is also constrained to the axis, which means it can't capture the true variance, and will result in noisy selection of the attribute to split on.

The next method I like, which is not widely used, are Vantage Point Trees (VP-trees). The logic is very similar, but the method is more general but also more expensive to build. At each level, a point from the data set is selected to tbe the Vantage Point ,the idea being it will have sight to all the other points at its node. The distance from each point to the VP is computed, and they are then sorted. The median value is selected as the splitting value, and two sub trees are formed. This median value is a radius, and we have formed a circle. Inside the radius are the points closer to the VP, and outside are the points farther. Again, we can see the halving gets us a tree with $O(\log n)$ depth. We then use the triangle inequality to prune out inner and outer sides, hopefully answering queries in $O(\log n)$ time.

Because we use the actual distances to build the tree, instead of axis, we get a method that works much better in higher dimensional spaces. However, we pay a price in construction time. We have to do $O(n \log n) $ distance computations, where KD-Trees are also $O(n \log n)$, no distance computations are needed at all. There is also the problem of how to select the Vantage Point? The originally proposed method is to select one by sampling the eligible points, and computing the VP that produces a lower variance in the sample. You can also select the VP randomly, and save a lot of expensive computation.

Another option I added, is how to make the split. Querying slows down when we are unable to successfully prune out branches, meaning our splits aren't adequately separating points that should be. Instead of selecting the median, we could instead select a radius that minimized the total variance of each sub node. This gives up the theoretical justification for expecting results in $O(\log n)$ time, but hopefully makes the separations better. It shows up in the graph below as VP-Tree-MV

So, how do they compare on some decently large dimension problems then? I've computed the average build and query times for selecting the 1-NN on four different data sets from the UCI repository. Build times were calcualted from the training set of 9-fold cross validation, and query times from the testing fold. This way there were no distances of zero, and we used all the data to get a good result.

The graphs for these times are below. The title indicates the data set, and below it is (dimension) x (data set size).

We can see the Naive method has no build time, which makes sense. There is no work to do! We just have to have the data points. We can also see that VP-Trees are very slow to build with sampling. While we can see an improvement in query times, its not by much. If we know we have a problem where we will be doing much more querying for a long time, it might be worth it. Otherwise, the cost is a bit too much. We can also see VP-Tree-MV often has a small advantage, at relatively no cost. Though its performance is likely to be more variable across data sets.

So, are VP-Trees always better? Usually for large dimensions. If you have a smaller dimensional problem, KD-Trees can perform a query using less resources. VP-Trees also have the advantage that they work for any distance metric, where KD-Trees are only valid for the p-norm metrics.

VP-Trees are similar to another algorithm called Ball-Trees. However, I've never found Ball-Trees to query faster than VP-Trees, and they are even more expensive to build. There also more complicated to implement, so I've only implemented VP-Trees in JSAT. Ball-Trees are very popular, but I've never understood why.

JSAT also has an implementation of R-Trees, however - they need specialized batch insertion methods to be efficient in practice. Because these methods are still slower than VP-Trees and Ball-Trees, and fairly complicated, I've never bothered to implement them.

Sunday, August 26, 2012

Visualizing NN and KDE

So I've talked about Nearest Neighbor and Kernel Density Estimators, but its often easier to understand things we can visualize. Well, I've been working on a tool for that recently - so I loaded up a data set from here, as its an excellent data set for demonstrative purposes. Its multi-modal, has a non-trivial decision boundary, and has no perfect separation in any higher dimension.

So here is the original data set

Original Data Set

First up, 1-NN.

1-NN Decision Boundary

We can clearly see that over fitting has occurred. There are lots of pockets of red and blue where there really shouldn't be because of noisy neighbors. A problem that always occurs if the two classes overlap.

Turns out 15-NN is much more reasonable.

15-NN Decision Boundary

The boundary is a little rough at times, but over all - its reflects the data very well. The cluster of the red class in the blue region is well formed and doesn't extend out too far. Good stuff.

However, we can also under-fit what if we took 150-NN?

150-NN Decision Boundary

While it looks cool, its not a great decision boundary. While the boundary makes a smooth transition, the red cluster in the top has taken over all the top - there just isn't enough blue points left to counter the very dense red cluster.

How does this compare to KDE then? Remember that NN will expand until it finds the points it needs, where as KDE fixes the volume - so it is possible that no data points will be found!

Here is the MetricKDE with the Epanechnikov kernel.

Pretty good stuff. We can see there is a little noise at the very edges of the world when we have very few points left, but other then that - the decision boundary is very good. We can compare this result with the 15-NN one. There are both pros and cons to being able to provide a best guess in uncharted territory (15-NN), or being able to return a "I really have no idea" as you can with the KDE.

I also said last time that the Kernel used doesn't have a huge impact. Was I lying? Well, here is using the same boundary using the Uniform kernel.

Uniform KDE

For the most part, its really the same. The transitions are not as smooth in the Uniform one. And we can see in the areas with outliers the difference in smoothness between the uniform kernel and Epanechnikov's.

What about the Gaussian?

Gauss KDE

Oh, what happened to the "I really have no idea" space? Well, the Gaussian kernel has infinite support, meaning it always returns a finite non zero value. So some points are always going to show up (without the trick mentioned in the prior post, every point would show up!). As such, we get every area colored, just like with the NN version.

But the Gaussian kernel has a few issues. It took several times longer to render then the others, the Gaussian kernel is more expensive to compute, and needs to be computed more often (because of the larger support). You may also notice the red dense region in the top is a bit wider then it should be. I mentioned last time that the heuristic used by MetricKDE to determine the right bandwidth is made for distributions with a support in the range of [-1,1]. Only about 68% of the points will fall in that range for the Gaussian, so that's what is causing that extra wideness there. Though you could implement a scaled version.

I'll look for some good data sets to demonstrate the differences (or lack there of) in using different distance metrics for a future post.

Wednesday, August 8, 2012

Probability Estimation: Kernel Density Estimators

So I've mentioned Nearest Neighbor classification and Distance Metrics, now I'm going to talk about Kernel Density Estimators (KDE), which are also called Parzen Windows. KDE's are related to both of these things, but first - the single variable case.

Suppose we have a vector of sample values $\vec{x} = \{x_1, \cdots, x_n\}$ from some unknown distribution $D$. How do we query the pdf of $D$ if we can not fit it to one of the distribution we know? KDEs provide a way to do this. The idea is fairly simple, we take a query point $q$, and set its probability based on how many prior examples are near by. In this case, near by is based on some distance measure. So we might want to see how many points are $h$ units away. This differs from kNN, where we didn't care how far away our neighbors were, we just wanted $k$ of them. Turns out to be pretty simple, and the $pdf$ of $D$ can be estimated by

$$pdf(q) = \frac{1}{n \cdot h} \sum_{i=1}^{n} \underbrace{K \overbrace{\left(\frac{q-x_i}{h}\right)}^{Distance}}_{Kernel Evaluation}$$

Where $K$ is a Kernel function that is essentially just a symmetric probability distribution, meaning
$$K(x) = K(-x)$$ and $$\int_{-\infty}^{\infty}K(x) = 1$$

The kernel function is controlling how much weight each prior example adds to the pdf. Basically, we place a small, scaled down distribution on top of each point we have seen. Our final distribution is then the sum of all these smaller ones. $K$ controls the distribution and $h$ controls the scale.

The Gaussian distribution
$$K(x) = \frac{1}{\sqrt{2 \pi}} e^{-\frac{1}{2}x^2}$$
is a valid kernel function. I prefer to use kernels that only return a non zero value in the [-1,1] range. The simplest kernel function, is the 'parzen window', is one such kernel.
$$K(x) = \frac{1}{2}\ 1_{(|x|\leq1)}$$
And the default kernel in JSAT is the Epanechnikov kernel. For classification, this is my favorite kernel. It is theoretically optimal, provides different weights based on distance, and is cheap to evaluate.
$$K(x) = \frac{3}{4}(1-x^2)\ 1_{(|x|\leq1)}$$

In turn, the distance is scaled by the bandwidth parameter $h$, so that larger values mean a larger search. This will smooth out the distribution, but might lose detail (under fitting). Too small a value, and the distribution will spike only at points we have seen before (over fitting). There is a lot of work and theory behind selecting the best $h$ for the single variate case, but I'll leave that to you to look up. I'm more interested in the multivariate case. It turns out there are a number of ways to generalize KDEs to multiple dimensions. My preferred way is to replaced the distance in our first equation with a distance metric, which makes the transition very intuitive.

$$pdf(\vec{q}) = \frac{1}{n \cdot h^{|\vec{q}|}} \sum_{i=1}^{n} \underbrace{K \overbrace{\left(\frac{d(\vec{q},\vec{x}_i)}{h}\right)}^{Distance}}_{Kernel Evaluation}$$

First, this becomes much easier to implement. Instead of summing across our whole data set, we can do a range query on a VectorCollection for all neighbors within $h$. This saves a lot of computation time. Note, this is not technically valid as a pdf, as it wont integrate to 1. But if you re-normalize when comparing against other KDEs, you get a valid probability again.

Second, is that taking the distance in any direction can lead to worse performance. Most sources that go over KDEs spend a lot of time using matrices for the bandwidth instead of a single scalar, so that they can adapt to better fit the data. Using this generalization, we can put all of that logic into the distance metric instead. For example, we could use the Mahalanobis metric from my last post to help it adapt better to the data.

Now we can estimate the probability distribution of an n-dimensional data set. We can also see how this relates back to k-NN, where we had the forumla
$$ P(\vec{x}) = \frac{k}{n \cdot V} $$

Where:
$k$ is the number of neighbors we are considering
$n$ is the total number of data points in the set
$V$ is the volume of the area we are considering.

In kNN we fixed the number of neighbors we would consider, and let the volume grow until we found as many neighbors as we needed. With KDEs, we fix the volume using the bandwidth, and let the number of neighbors vary. We can then perform classification by fitting a KDE to each class, and choosing the class with the highest probability for each query point.

In the end, KDEs and kNN often have very similar performance once their parameters are optimally tuned, and each can perform better with less work depending on the data at hand. Some advantages of KDE that I like are:

Easy parameter estimation. For kNN, you just have to try many values of k in cross validation, which can be very slow. KDEs can get a very good estimate by simply taking the mean + 3 standard deviations of the 3rd nearest neighbor. Intuitively, if a query is on the edge of the distribution, this means you have a good probability of getting 3 neighbors in the range, and a very good chance of getting at least one. This is useful to ensure that you get a non-zero probability for points that should get one.
Smooth probability estimates. kNN return probabilities as the fraction of neighbors in the same class, meaning increments sof $1/k$. When using probabilities as a cut off, its annoying to have to remember that 0.9 isn't a possible probability if you only take 7 neighbors.
Easy anomaly detection. In kNN we expand the volume until we find our neighbors, in KDE the volume is fixed. That means there may be no neighbors nearby, so a probability of 0 gets returned. You can use this as a simple and intuitive form of anomaly detection during classification. (You may want to refer such unusual inputs to a human expert!)

The KDE I've described here is implemented in JSAT as the MetricKDE, and is the preferred one. Another KDE implementation, ProductKDE, is also available. It does not require an input kernel, however - it requires a lot more data to return good decision boundaries, and is considerably slower. So I do not recommend using it.

Monday, July 16, 2012

Distance Metrics

In my last post, I mentioned that we needed some distance metric in order to use the nearest neighbor algorithm. While you can define a distance metric for anything, I'll be restricting myself to just the JSAT definition of the distance between two vectors $\vec{x}$ and $\vec{y}$, which we denote as $d(\vec{x},\vec{y})$.

First thing is first, a true distance metric has to have some specific properties.

It needs to be indesernible, meaning $d(\vec{x},\vec{y}) = 0 $ if and only if $\vec{x} = \vec{y}$
It needs to be symmetric, meaning $d(\vec{x},\vec{y}) = d(\vec{y},\vec{x}) $
It needs to be sub-additive, meaning $d(\vec{x},\vec{z}) \leq d(\vec{x},\vec{y}) + d(\vec{y},\vec{z}) $

Many sources also list that $d(\vec{x}, \vec{y}) \geq 0 $, but it really dose not need to be stated. Having the first 3 conditions satisfied implies the non-negativity. Both the first and the 2nd properties are also pretty easy to deal with. You just have to make sure everything cancels out to get property one. If your metric isn't symmetric, you can easily make it so by defining
\[\hat{d}(\vec{x},\vec{y}) = d(\vec{x},\vec{y}) + d(\vec{y},\vec{x})\]
It's the third property, often called the triangle inequality, that is really important. Its what allows JSAT to implement VectorCollections that can return the nearest neighbor in $log(n)$ time for many datasets. But I'll discuss such tactics another day.

What I really want to get to right now, is how a distance metric can be helpful. Probably the most common metric is the L2 norm, also called the Euclidian Distance.
\[d(\vec{x}, \vec{y}) = \|\vec{x}-\vec{y}\| = \sqrt{\sum_{i = 1}^{n} (x_i - y_i)^2 } \]
which is a specific case of the p-Norm
\[\left(\sum_{i = 1}^{n} |x_i - y_i|^p \right)^{\frac{1}{p}} \]

Now, something I did not discues last time, is that the nearest neighbor algorithm is sensitive to the scale of the variables. What does that mean?

Consider a data set where we want to estimates the maximum capacity of a building. Let us say its attributes are the number of floors and the square feet per floor. Two dimensions.

Data point 1: $\vec{x} = \{1000 \text{ floors}, 2 \text{ square feet}\}$
Data point 2: $\vec{y} = \{500 \text{ floors}, 4 \text{ square feet}\}$

Clearly we can see, $1000 \cdot 2 = 2000$ total square feet for $\vec{x}$, and $500 \cdot 4 = 2000$ square feet for $\vec{y}$. Our intuition would tell us that they would hold a similar number of people, perhaps the need for larger elevators in one building causes the difference. Using the L2 norm, we get

\[d(\vec{x}, \vec{y}) = \sqrt{(1000-500)^2 + (2-4)^2} = 2 \sqrt{62501} \approx 500 \]

Oh no batman! The number of floors is so much greater than the square footage, its dominating our distance calculation! Logically we would expect these two points to be near each other, since they have the same total square footage, but our metric has placed them far apart!

Now obviously, I made up this example to illustrate the problem. One solution is to create our own custom distance metric with expert knowledge. Not only is this time intensive, if our custom metric does not obey property (3) above, we have to use $O(n)$ searches if we wanted to use nearest neighbor!

The most common solutions I see is that people choose to scale the whole data set. They make sure all values are in the range $[0,1]$ or $[-1,1]$. If you go this route, you also have to remember to scale the new points you want to classify. A minor inconvenience, but nothing horrible.

A less traveld path, that I like to use, is to let my distance metric handle the problem for me. The Mahalanobis metric can do just that in many cases. If $\Sigma$ is the covariance matrix for our data set, then the Mahalanobis distance is

\[d(\vec{x},\vec{y}) = \sqrt{ (\vec{x}-\vec{y})^T \Sigma^{-1} (\vec{x}-\vec{y}) } \]

So wat just happened here? If we set $\Sigma$ equal to the identity matrix, we get

\[\sqrt{ (\vec{x}-\vec{y})^T I^{-1} (\vec{x}-\vec{y}) } \]

\[\sqrt{ (\vec{x}-\vec{y})^T (\vec{x}-\vec{y}) } \]

This is taking the diference between each value, squaring them, and summing them together. Wait a minute - we've seen this before! After some arithmetic, we would see that its equal to

\[ \|x-y\| \]

So if there was no variance, we are equivalent to the Euclidean distance! When variance is present, by multiplying with $\Sigma^{-1}$ we will perform a scaling of the individual attributes based on their co-variance. An additional benefit, is that redundant variables will have less weight, which avoids over counting the contributions of redundant features.

This is a fair amount of extra work to perform during distance computations, but it is very convient. Even if the data has already been scaled, using the Mahalanobis metric can still give additional gains in accuracy. Though its not perfect, and works best when the data can be well approximated by a Gaussian distribution. When the data is really abnormal, it can actually hurt the performance of our algorithms.

But hopefully you see that its worth putting some thought into selecting your distance metric. In many cases, the L2 will do just fine. But when your hunting down the last 3% of accuracy you need, or trying to deal with a poorly behaved data set, consider lifting some of the work off of the algorithm and into the metric.

Saturday, July 7, 2012

Probability Estimation: Nearest Neighbor

In my last post, I talked about how estimating the probability of some event $x $ is useful to us. But I neglected mentioning how. This task is what many ML are trying to do. The Nearest Neighbor algorithm, and a very simple method that does this, and is really unreasonable accurate for many applications.

In fact, the algorithm is easier to understand then the theory behind it. So we will start there.

In words: if you have a training data set $D$, composed of $n$ data points $\vec{x}_i$, paired with a target value $y_i$. Then we can classify a new data point $q$, by finding the data point in $D$ that is closest to $q$, and giving $q$ the same class.

To be a bit more explicit, we need a concept of distance. We denote the distance between two points $a$ and $b$ by $d(a, b)$. Now, we can describe the algorithm a bit more explicitly.

$$
\text{Find } \min_i{d(\vec{x}_i, \vec{q})} \forall i \in D \\
\vec{q} \text{'s label is then the same label as } y_i
$$

Really, the intuition is very simple. We expect similar data points to be located near each other, and therefor - are likely to have the same class or value. This method works for both classification problems and regression problems. We can also improve performance by considering the $k$ nearest neighbors (k-NN), the intuition being that if we get multiple neighbors, we wont get thrown off by one odd or mislabeled data point. But how is this related to estimating the probability of something?

We can compute the probability density of a data set $D$ as
$$ P(\vec{x}) \approx \frac{k}{n \cdot V} $$
where
$k$ is the number of data points we are considering
$n$ is the number of data points in $D$
$V$ is the volume of the area.

For the k-NN algorithm, we are using this implicitly. We fix the number of neighbors $k$ ahead of time, the data set does not change size, and then the volume $V$ changes depending on how far away the $k$ nearest neighbors are. If all the neighbors are very close, $V$ will be small, and make the final probability larger. If they are far away, then $V$ will be large, and make the probability smaller.

Why dont we use this form of the algorithm? Well, turns out - its not a very accurate measure to use explicitly. In another post, we will look at fixing the volume and letting $k$ change as needed. Yet still, k-NN works well in practice, and has some strong theoretical results. You can use the k-NN algorithm in JSAT with the NearestNeighbour class.

So are there any down sides to k-NN? Yes. Unlike most algorithms, k-NN shifts all of its training work to extra work that needs to be done durring classification. The naive implementation has the issue that we must example every data point to make a decision, meaning it will take $O(n)$ work to compute a single result! To tackle this issue, JSAT separates out the k-NN algorithm from the process of finding the nearest neighbor of a query into the VectorCollection interface. A VectorCollection knows how to search a data set to find either the $k$ nearest neighbors, or all neighbors within a certain radius.

Despite faster methods for getting the nearest neighbors, k-NN can still be very slow, and cant always be used for large data sets.

Sunday, July 1, 2012

What is a Probability Density Function, and what is it good for?

A lot of Machine Learning is about estimating the probability of classes. But first, how do we describe the probability of something? I'll talk about this in fairly general terms, and skipping over a lot of vigorous math. This is mostly because I want to introduce the concept, and introduce some hight level concepts.

So, how do we represent the probability of $ x$ occurring? We denote it $P(x)$, where the 'P' is for probability. The probability could be for something discrete ( where $x$ can have only one of a finite number of values), or continuous. Flipping a coin has only two possibilities - heads or tails [Or perhaps it has three: heads, tails, and landing on its side]. One would denote the probability of a coin flip being 'Heads' as $P(x = Heads)$.

While discrite distributions are useful, I find continuous probability distributions more interesting an useful for ML. As the name implies, continuous distributions represent the probability of $x$ having any value in a given range. Often, this range is in $(-\infty , \infty)$. So say we are trying to estimate the probability of a bright eyed young collage graduate getting a $60k job? We would write $P(x = 60,000)$. However, this does not make as much sense as we might think. We will get to why in a moment.

I like to phrase a probability distribution by its probability density function, which we denote $f(x)$. So $P(x) = f(x)$. Seems simple, but, wouldn't we expect $P(x = 60,000) \approx P(x = 60,001) $. So say there is a 1% chance of getting a 60k job. By our intuition, $60,001 would also be around 1%, and so on. But say were incremented even smaller, by one cent. Do you see the problem? The sum of probabilities for just a range of a few dollars would be over 100%!

So really, you cant use $P(x)$ when $x \in \Re $, but you can use it to compare which is more likely, $x$ or $x + \epsilon $, even if its only going to be a small difference. Which is useful.

But back to the probability density function, what good is it for? Well, turns out - we can reasonable ask for the probability that we get a job offer in the range of 55k and 65k, or $P(55k \leq x \leq 65k)$.

Because we have the function for the probability, and by definition its integral must sum to one, we can write
$$P(a \leq x \leq b) = \int_{a}^{b} f(x) $$
Now this is meaningful! We can ask for the probability of getting a job offer in a certain range. This is also useful to provide meaningful information.

For example, say it takes $Z$ dollars a year to live a life where you can afford food, shelter, and medical care. We can then use our density function and find
$$P(-\infty \leq x \leq Z)$$
This means we want the probability of a job offer for $x$ being less then the amount of money an individual would need to afford basic needs.This is important information, and the probability of this happening is something we would want to minimize. Now that we can ask for it, we can target it and monitor changes in the value.

We can also alter this and ask: what is the mean job offer? We essentially need to average over all posible values, which can be written as
$$ Mean = \int_{-\infty}^{\infty} x \cdot f(x) $$
We may also want to know the median, which is more resilient to outliers. In this case, we can solve
$$ \int_{-\infty}^{Median} f(x) = \frac{1}{2} $$

This can also be useful in ML. Say we want to try and identify someone as a Computer Science graduate or a Philosophy graduate based on the value of their job offer. One way to do this, would be to compare their probabilities. If $P_{CS}(x) > P_{P}(x) $ then we would say its more likely they were a Computer Science major.

Another option, would be to take the means of each distribution, getting $ Mean_{CS} $ and $ Mean_{P} $. If a job offer was closer the the mean of a philosophy major, then we would say they were probably a Philosophy major. We would check which is smaller, $ Mean_{CS}-x $ or $ Mean_{P}-x $. The same method could be done using the median instead.

So cool, now we can classify things using the probability density function $f(x)$! But at no point have I said how we find $f(x)$! Finding that function, or approximating it, is basically what a lot of Machine Learning is about, and well talk about that more later.

How we use the pdf depends on what our goals are. If computing $P(x)$ is expensive, the first method of just comparing probabilities wont work well if we have to do thousands of classifications every minute. If we used the second method, which only needs 2 subtractions, we know it will be really fast - but might not be as accurate. Its a trade off, which is another topic that will be discussed.

Friday, June 29, 2012

What is Machine Learning?

So what is Machine Learning? A prerequisite question if you plan on using it, and associated with such buzz words as "big data". I like to describe it as the intersection of Artificial Intelligence and Statistics.

AI is trying to get computers to make smart decisions. This is different from what many people think, that AI wants to get computers to think like humans. No such goal really exists, because humans do stupid things. No banker wants an AI system buying thousands of lottery tickets as an investment. Statistics is the science of analyzing and interpreting data.

So, ML is their union - we want to get computers to make smart decisions (or tell us something useful) by providing them data. Not just any data, but data about what we want to make a decision of. The data is in the form of a data set, which is made of many data points. But what is in a data point?

A data point can have categorial values. For example, a car can have either blue, red, black, or white paint. It data point can also have numerical values; a car's engin has some number of horsepower, and a certain liter size. All data points in a data set are consistent, they each have the same types of values in the same order.

Many algorithms work especially well for data sets that have all numerical values. In this case, we often denote each data point $ x $ in a data set as a vector, $\vec{x} = \{x_0, x_1, \ldots, x_{n-1} \}$, where $n$ is the number of values that make up a data point. But data points can have any mix of numeric and categorical values, and you can convert between the two (at some cost).

The three ML problems I will be talking about are Classification, Regression, and Clustering.

Classification is the task of assigning a category to an example. Spam filters are a classification problem. An email can be undesirable (spam), or desirable (called ham, because ham is better then spam!). In these problems we know all of the categories before we begin, and we provide prior examples of the problem where we know the answer. That means we first train our spam filter using lots of examples of spam emails and ham emails. We then use it after its been trained.

Regression is similar to classification, but instead we have a numerical value we wish to predict. For example, a car manufacture might want to use the engine size, liter size, and car weight to predict the miles per gallon a car will get.

Classification and Regression problems are very related, and you can actually convert algorithms that do classification into regression, and regression into classifiers. But this does not come for free, and has various trade offs. Many algorithms also support both intrinsically, and do not have to be re-written or converted.

The last problem is Clustering. The first two problems are supervised problems - meaning a human has to provide input. Clustering is un-supervised - we provide the algorithm with no human provided labels. The goal is not to predict anything, but to discover something we dont know about the data. For example, a company might use clustering to discover groups of similar customers. They could then produce marketing aimed at a specific subset of their customers.

Thursday, June 28, 2012

What on earth is this about?

So I'm starting a blog. Whats it about? It is about the Java Statistical Analysis Tool, a library I'm writing in my free time for fun and to teach myself. Its constantly evolving, and has lots of mistakes in design in it, and will eventually change several times. There are also several parts that I should have just used a pre-made library for, but did myself just for learning purposes. Its a bad habit, but all well.

But I'm hoping it will be helpful for people. Currently, I'm focusing it towards use in Machine Learning (ML), specifically focusing on Classification, Regression, and Clustering.

So what will I put here that you want to read? Hopefully everything, but unlikely. I'll be putting in a number of introductory and high level posts, intended for people who are technical in general but not in ML. I'll also post about algorithms and methods I have or am implementing, often in the context of using them in JSAT.