Monday, July 16, 2012

Distance Metrics

In my last post, I mentioned that we needed some distance metric in order to use the nearest neighbor algorithm. While you can define a distance metric for anything, I'll be restricting myself to just the JSAT definition of the distance between two vectors \(\vec{x}\) and \(\vec{y}\), which we denote as \(d(\vec{x},\vec{y})\).

First thing is first, a true distance metric has to have some specific properties. 
  1. It needs to be indesernible, meaning \(d(\vec{x},\vec{y}) = 0 \)  if and only if \(\vec{x} = \vec{y}\)
  2. It needs to be symmetric, meaning \(d(\vec{x},\vec{y}) = d(\vec{y},\vec{x}) \) 
  3. It needs to be sub-additive, meaning \(d(\vec{x},\vec{z}) \leq d(\vec{x},\vec{y}) + d(\vec{y},\vec{z}) \) 
Many sources also list that \(d(\vec{x}, \vec{y}) \geq 0 \), but it really dose not need to be stated. Having the first 3 conditions satisfied implies the non-negativity. Both the first and the 2nd properties are also pretty easy to deal with. You just have to make sure everything cancels out to get property one. If your metric isn't symmetric, you can easily make it so by defining 
\[\hat{d}(\vec{x},\vec{y}) = d(\vec{x},\vec{y}) + d(\vec{y},\vec{x})\]
It's the third property, often called the triangle inequality, that is really important. Its what allows JSAT to implement VectorCollections that can return the nearest neighbor in \(log(n)\) time for many datasets. But I'll discuss such tactics another day. 


What I really want to get to right now, is how a distance metric can be helpful. Probably the most common metric is the L2 norm, also called the Euclidian Distance.
\[d(\vec{x}, \vec{y}) = \|\vec{x}-\vec{y}\| = \sqrt{\sum_{i = 1}^{n} (x_i - y_i)^2 } \]
which is a specific case of the p-Norm 
\[\left(\sum_{i = 1}^{n} |x_i - y_i|^p \right)^{\frac{1}{p}} \]

Now, something I did not discues last time, is that the nearest neighbor algorithm is sensitive to the scale of the variables. What does that mean?

Consider a data set where we want to estimates the maximum capacity of a building. Let us say its attributes are the number of floors and the square feet per floor. Two dimensions.

  • Data point 1: \(\vec{x} =  \{1000 \text{ floors}, 2 \text{ square feet}\}\)
  • Data point 2: \(\vec{y} =  \{500 \text{ floors}, 4 \text{ square feet}\}\)
Clearly we can see, \(1000 \cdot 2 = 2000\) total square feet for \(\vec{x}\), and \(500 \cdot 4 = 2000\)  square feet for \(\vec{y}\). Our intuition would tell us that they would hold a similar number of people, perhaps the need for larger elevators in one building causes the difference. Using the L2 norm, we get 
\[d(\vec{x}, \vec{y}) = \sqrt{(1000-500)^2 + (2-4)^2} = 2 \sqrt{62501} \approx 500 \]

Oh no batman! The number of floors is so much greater than the square footage, its dominating our distance calculation! Logically we would expect these two points to be near each other, since they have the same total square footage, but our metric has placed them far apart! 

Now obviously, I made up this example to illustrate the problem. One solution is to create our own custom distance metric with expert knowledge. Not only is this time intensive, if our custom metric does not obey property (3) above, we have to use \(O(n)\) searches if we wanted to use nearest neighbor! 

The most common solutions I see is that people choose to scale the whole data set. They make sure all values are in the range \([0,1]\) or \([-1,1]\). If you go this route, you also have to remember to scale the new points you want to classify. A minor inconvenience, but nothing horrible. 

A less traveld path, that I like to use, is to let my distance metric handle the problem for me. The Mahalanobis metric can do just that in many cases. If \(\Sigma\) is the covariance matrix for our data set, then the Mahalanobis distance is 
\[d(\vec{x},\vec{y}) = \sqrt{ (\vec{x}-\vec{y})^T \Sigma^{-1} (\vec{x}-\vec{y}) } \]
So wat just happened here? If we set \(\Sigma\) equal to the identity matrix, we get 
\[\sqrt{ (\vec{x}-\vec{y})^T  I^{-1}  (\vec{x}-\vec{y}) } \]
\[\sqrt{ (\vec{x}-\vec{y})^T (\vec{x}-\vec{y}) } \]
This is taking the diference between each value, squaring them, and summing them together. Wait a minute - we've seen this before! After some arithmetic, we would see that its equal to 
\[ \|x-y\| \]
So if there was no variance, we are equivalent to the Euclidean distance! When variance is present, by multiplying with \(\Sigma^{-1}\) we will perform a scaling of the individual attributes based on their co-variance. An additional benefit, is that redundant variables will have less weight, which avoids over counting the contributions of redundant features. 

This is a fair amount of extra work to perform during distance computations, but it is very convient. Even if the data has already been scaled, using the Mahalanobis metric can still give additional gains in accuracy. Though its not perfect, and works best when the data can be well approximated by a Gaussian distribution. When the data is really abnormal, it can actually hurt the performance of our algorithms. 

But hopefully you see that its worth putting some thought into selecting your distance metric. In many cases, the L2 will do just fine. But when your hunting down the last 3% of accuracy you need, or trying to deal with a poorly behaved data set, consider lifting some of the work off of the algorithm and into the metric. 

Saturday, July 7, 2012

Probability Estimation: Nearest Neighbor

In my last post, I talked about how estimating the probability of some event \(x \) is useful to us. But I neglected mentioning how. This task is what many ML are trying to do. The Nearest Neighbor algorithm, and a very simple method that does this, and is really unreasonable accurate for many applications. 


In fact, the algorithm is easier to understand then the theory behind it. So we will start there. 


In words: if you have a training data set \(D\), composed of \(n\) data points \(\vec{x}_i\), paired with a target value  \(y_i\). Then we can classify a new data point \(q\), by finding the data point in \(D\) that is closest to \(q\), and giving \(q\) the same class. 


To be a bit more explicit, we need a concept of distance. We denote the distance between two points \(a\) and \(b\) by \(d(a, b)\). Now, we can describe the algorithm a bit more explicitly. 

$$
\text{Find } \min_i{d(\vec{x}_i, \vec{q})} \forall i \in D \\
\vec{q} \text{'s label is then the same label as } y_i
$$

Really, the intuition is very simple. We expect similar data points to be located near each other, and therefor - are likely to have the same class or value. This method works for both classification problems and regression problems. We can also improve performance by considering the \(k\) nearest neighbors (k-NN), the intuition being that if we get multiple neighbors, we wont get thrown off by one odd or mislabeled data point.  But how is this related to estimating the probability of something?

We can compute the probability density of a data set \(D\) as
$$ P(\vec{x}) \approx \frac{k}{n \cdot V} $$
where
\(k\) is the number of data points we are considering
\(n\) is the number of data points in \(D\)
\(V\) is the volume of the area.

For the k-NN algorithm, we are using this implicitly. We fix the number of neighbors \(k\) ahead of time, the data set does not change size, and then the volume \(V\) changes depending on how far away the \(k\) nearest neighbors are. If all the neighbors are very close, \(V\) will be small, and make the final probability larger. If they are far away, then \(V\) will be large, and make the probability smaller.

Why dont we use this form of the algorithm? Well, turns out - its not a very accurate measure to use explicitly. In another post, we will look at fixing the volume and letting \(k\) change as needed. Yet still, k-NN works well in practice, and has some strong theoretical results. You can use the k-NN algorithm in JSAT with the NearestNeighbour class.

So are there any down sides to k-NN? Yes. Unlike most algorithms, k-NN shifts all of its training work to extra work that needs to be done durring classification. The naive implementation has the issue that we must example every data point to make a decision, meaning it will take \(O(n)\) work to compute a single result! To tackle this issue, JSAT separates out the k-NN algorithm from the process of finding the nearest neighbor of a query into the VectorCollection interface. A VectorCollection knows how to search a data set to find either the \(k\) nearest neighbors, or all neighbors within a certain radius.

Despite faster methods for getting the nearest neighbors, k-NN can still be very slow, and cant always be used for large data sets.


Sunday, July 1, 2012

What is a Probability Density Function, and what is it good for?

A lot of Machine Learning is about estimating the probability of classes. But first, how do we describe the probability of something? I'll talk about this in fairly general terms, and skipping over a lot of vigorous math. This is mostly because I want to introduce the concept, and introduce some hight level concepts.

So, how do we represent the probability of \( x\) occurring?  We denote it \(P(x)\), where the 'P' is for probability. The probability could be for something discrete ( where \(x\) can have only one of a finite number of values), or continuous. Flipping a coin has only two possibilities - heads or tails [Or perhaps it has three: heads, tails, and landing on its side]. One would denote the probability of a coin flip being 'Heads' as \(P(x = Heads)\). 


While discrite distributions are useful, I find continuous probability distributions more interesting an useful for ML. As the name implies, continuous distributions represent the probability of \(x\) having any value in a given range. Often, this range is in \((-\infty , \infty)\). So say we are trying to estimate the probability of a bright eyed young collage graduate getting a $60k job? We would write \(P(x = 60,000)\). However, this does not make as much sense as we might think. We will get to why in a moment. 


I like to phrase a probability distribution by its probability density function, which we denote \(f(x)\). So \(P(x) = f(x)\). Seems simple, but, wouldn't we expect \(P(x = 60,000) \approx P(x = 60,001) \). So say there is a 1% chance of getting a 60k job. By our intuition, $60,001 would also be around 1%, and so on. But say were incremented even smaller, by one cent. Do you see the problem? The sum of probabilities for just a range of a few dollars would be over 100%! 


So really, you cant use  \(P(x)\) when  \(x \in \Re \), but you can use it to compare which is more likely, \(x\) or \(x + \epsilon \), even if its only going to be a small difference. Which is useful. 


But back to the probability density function, what good is it for? Well, turns out -  we can reasonable ask for the probability that we get a job offer in the range of 55k and 65k, or \(P(55k \leq x \leq 65k)\). 


Because we have the function for the probability, and  by definition its integral must sum to one, we can write
$$P(a \leq x \leq b) = \int_{a}^{b} f(x) $$
Now this is meaningful! We can ask for the probability of getting a job offer in a certain range. This is also useful to provide meaningful information.

For example, say it takes \(Z\) dollars a year to live a life where you can afford food, shelter, and medical care. We can then use our density function and find
$$P(-\infty \leq x \leq Z)$$
This means we want the probability of a job offer for \(x\) being less then the amount of money an individual would need to afford basic needs.This is important information, and the probability of this happening is something we would want to minimize. Now that we can ask for it, we can target it and monitor changes in the value.

We can also alter this and ask: what is the mean job offer? We essentially need to average over all posible values, which can be written as
$$ Mean = \int_{-\infty}^{\infty} x \cdot f(x) $$
We may also want to know the median, which is more resilient to outliers. In this case, we can solve
$$ \int_{-\infty}^{Median} f(x) = \frac{1}{2} $$

This can also be useful in ML. Say we want to try and identify someone as a Computer Science graduate or a Philosophy graduate based on the value of their job offer. One way to do this, would be to compare their probabilities. If \(P_{CS}(x) > P_{P}(x) \) then we would say its more likely they were a Computer Science major.

Another option, would be to take the means of each distribution, getting \( Mean_{CS} \) and \( Mean_{P} \). If a job offer was closer the the mean of a philosophy major, then we would say they were probably a Philosophy major. We would check which is smaller,  \( Mean_{CS}-x \) or \( Mean_{P}-x \). The same method could be done using the median instead. 

So cool, now we can classify things using the probability density function \(f(x)\)! But at no point have I said how we find \(f(x)\)! Finding that function, or approximating it, is basically what a lot of Machine Learning is about, and well talk about that more later.

How we use the pdf depends on what our goals are. If computing \(P(x)\) is expensive, the first method of just comparing probabilities wont work well if we have to do thousands of classifications every minute. If we used the second method, which only needs 2 subtractions, we know it will be really fast - but might not be as accurate. Its a trade off, which is another topic that will be discussed.