First thing is first, a true distance metric has to have some specific properties.
- It needs to be indesernible, meaning \(d(\vec{x},\vec{y}) = 0 \) if and only if \(\vec{x} = \vec{y}\)
- It needs to be symmetric, meaning \(d(\vec{x},\vec{y}) = d(\vec{y},\vec{x}) \)
- It needs to be sub-additive, meaning \(d(\vec{x},\vec{z}) \leq d(\vec{x},\vec{y}) + d(\vec{y},\vec{z}) \)
Many sources also list that \(d(\vec{x}, \vec{y}) \geq 0 \), but it really dose not need to be stated. Having the first 3 conditions satisfied implies the non-negativity. Both the first and the 2nd properties are also pretty easy to deal with. You just have to make sure everything cancels out to get property one. If your metric isn't symmetric, you can easily make it so by defining
\[\hat{d}(\vec{x},\vec{y}) = d(\vec{x},\vec{y}) + d(\vec{y},\vec{x})\]
It's the third property, often called the triangle inequality, that is really important. Its what allows JSAT to implement VectorCollections that can return the nearest neighbor in \(log(n)\) time for many datasets. But I'll discuss such tactics another day.
What I really want to get to right now, is how a distance metric can be helpful. Probably the most common metric is the L2 norm, also called the Euclidian Distance.
\[d(\vec{x}, \vec{y}) = \|\vec{x}-\vec{y}\| = \sqrt{\sum_{i = 1}^{n} (x_i - y_i)^2 } \]
which is a specific case of the p-Norm
\[\left(\sum_{i = 1}^{n} |x_i - y_i|^p \right)^{\frac{1}{p}} \]
Now, something I did not discues last time, is that the nearest neighbor algorithm is sensitive to the scale of the variables. What does that mean?
Consider a data set where we want to estimates the maximum capacity of a building. Let us say its attributes are the number of floors and the square feet per floor. Two dimensions.
\[\hat{d}(\vec{x},\vec{y}) = d(\vec{x},\vec{y}) + d(\vec{y},\vec{x})\]
It's the third property, often called the triangle inequality, that is really important. Its what allows JSAT to implement VectorCollections that can return the nearest neighbor in \(log(n)\) time for many datasets. But I'll discuss such tactics another day.
What I really want to get to right now, is how a distance metric can be helpful. Probably the most common metric is the L2 norm, also called the Euclidian Distance.
\[d(\vec{x}, \vec{y}) = \|\vec{x}-\vec{y}\| = \sqrt{\sum_{i = 1}^{n} (x_i - y_i)^2 } \]
which is a specific case of the p-Norm
\[\left(\sum_{i = 1}^{n} |x_i - y_i|^p \right)^{\frac{1}{p}} \]
Now, something I did not discues last time, is that the nearest neighbor algorithm is sensitive to the scale of the variables. What does that mean?
Consider a data set where we want to estimates the maximum capacity of a building. Let us say its attributes are the number of floors and the square feet per floor. Two dimensions.
- Data point 1: \(\vec{x} = \{1000 \text{ floors}, 2 \text{ square feet}\}\)
- Data point 2: \(\vec{y} = \{500 \text{ floors}, 4 \text{ square feet}\}\)
Clearly we can see, \(1000 \cdot 2 = 2000\) total square feet for \(\vec{x}\), and \(500 \cdot 4 = 2000\) square feet for \(\vec{y}\). Our intuition would tell us that they would hold a similar number of people, perhaps the need for larger elevators in one building causes the difference. Using the L2 norm, we get
\[d(\vec{x}, \vec{y}) = \sqrt{(1000-500)^2 + (2-4)^2} = 2 \sqrt{62501} \approx 500 \]
Oh no batman! The number of floors is so much greater than the square footage, its dominating our distance calculation! Logically we would expect these two points to be near each other, since they have the same total square footage, but our metric has placed them far apart!
Now obviously, I made up this example to illustrate the problem. One solution is to create our own custom distance metric with expert knowledge. Not only is this time intensive, if our custom metric does not obey property (3) above, we have to use \(O(n)\) searches if we wanted to use nearest neighbor!
The most common solutions I see is that people choose to scale the whole data set. They make sure all values are in the range \([0,1]\) or \([-1,1]\). If you go this route, you also have to remember to scale the new points you want to classify. A minor inconvenience, but nothing horrible.
A less traveld path, that I like to use, is to let my distance metric handle the problem for me. The Mahalanobis metric can do just that in many cases. If \(\Sigma\) is the covariance matrix for our data set, then the Mahalanobis distance is
\[d(\vec{x},\vec{y}) = \sqrt{ (\vec{x}-\vec{y})^T \Sigma^{-1} (\vec{x}-\vec{y}) } \]
So wat just happened here? If we set \(\Sigma\) equal to the identity matrix, we get
\[\sqrt{ (\vec{x}-\vec{y})^T I^{-1} (\vec{x}-\vec{y}) } \]
\[\sqrt{ (\vec{x}-\vec{y})^T (\vec{x}-\vec{y}) } \]
This is taking the diference between each value, squaring them, and summing them together. Wait a minute - we've seen this before! After some arithmetic, we would see that its equal to
\[ \|x-y\| \]
So if there was no variance, we are equivalent to the Euclidean distance! When variance is present, by multiplying with \(\Sigma^{-1}\) we will perform a scaling of the individual attributes based on their co-variance. An additional benefit, is that redundant variables will have less weight, which avoids over counting the contributions of redundant features.
This is a fair amount of extra work to perform during distance computations, but it is very convient. Even if the data has already been scaled, using the Mahalanobis metric can still give additional gains in accuracy. Though its not perfect, and works best when the data can be well approximated by a Gaussian distribution. When the data is really abnormal, it can actually hurt the performance of our algorithms.
But hopefully you see that its worth putting some thought into selecting your distance metric. In many cases, the L2 will do just fine. But when your hunting down the last 3% of accuracy you need, or trying to deal with a poorly behaved data set, consider lifting some of the work off of the algorithm and into the metric.