Wednesday, February 25, 2015

ML Theory, some Common Questions

Prompted by a few recent exchanges, I thought I would take a stab at pre-answering a few common questions I see/get about some results in Machine Learning, and how they relate to applying. You can find these all elsewhere as well.

Q: A Neural Network with a single hidden layer can approximate any function arbitrarily well, so why have deep learning / multiple layers?  Why don't Neural Networks solve everything? 

A: First, I've intentionally phrase this one in the way I hear it - because its not quite true. It can only approximate functions that are continuous arbitrarily well. To get discontinuous functions in the mix, we need a second hidden layer. But that's not terribly important.

What this result doesn't tell us is how many neurons to use, and the one of the benefits of having multiple layers is that it may take less neurons overall to represent something with a hierarchy than to try and represent it with one colossal hidden layer. That is to say, the power of a network to represent things is not linear in the number of neurons and hidden layers. Increasing the number of hidden layers can significantly simplify the problem / complexity of the network.

The other issue is this result is only for predicting the response in the region the network was trained on. It tells us nothing about how the network might perform when given an outlier as input. This subtly leads us to the issue that this statement doesn't tell us how the network will generalize to new data. If we have concept drift, or not enough data to fully encompass the problem, this result does not help us at all - as we won't be able to generalize to new data.

Q: Algorithm X converges to the Bayes optimal error rate (times some \(\epsilon\) ), so I just need enough data and I will get the optimal classifier. 

A: This is similar to the first question, and is a result that is true for doing a simple Nearest Neighbor search, but instead of function approximation the result is explicitly for classification. As one may have guessed, this doesn't tell us anything about how much data is needed to get to the Bayes error rate. Even more sadly: this doesn't tell us how close we are to reaching the Bayes error rate / when we get there - just that eventually we will.

A more subtle issue comes from the use of a classification problem instead of a regression / function approximation problem. When we talk about function approximation, we generally assume that we know all the inputs to the function (thought this isn't necessarily true). When we set up classification problems in practice we generally don't know for sure what the needed features are. This is important because the Bayes optimal error rate given is dependent on the features given ie: different sets of features have different Bayes optimal error rates. It doesn't tell us that we will get to the best possible error rate for the problem, but for the features given the problem.

Q: Logistic Regression and Support Vector Machines all use a loss function, but why don't we use the zero-one loss? Phrased another way: why don't we learn to minimize the classification errors, rather than some other loss function. 

A: The answer to this is simply we don't know how! Its also an interesting question if this would be the best thing to do (ie: will it generalize to new data?). SVMs and Logistic Regression are large-margin classifiers (where the form is the maximum-margin), and its often been found that large-margin models work better than those that don't have a large margin. However a large-margin loss incurs can incur a penalty on data that they classify correctly, where the zero-one loss wouldn't. So that result should give us pause to think about the real problem: what should we optimize to maximize our generalized error? Again, we don't really know!

Q: I don't understand why high-dimensional data is more difficult. What's the intuition to the curse-of-dimensionality being a problem, isn't more better?

A: For explaining the intuition, I like to think of it like this: For every feature/input, we have X many weights to learn. So lets assume that each data point can contribute to learning only one of the weights correctly. If we have just as many data points as we have weights, and the data is perfectly clean, then we can solve the problem (if the problem is full rank). However data is never perfect and functions can be noisy / overlap, so we want multiple data points for each feature. That way we can get a better idea of a weight value that works best for most of the features.

When we have high-dimensional problems, \(D >> N\), that means we can't even get a full rank solution! Thats the basis for an intuition of why high dimensional. The way I've explained this is that we need a linear relationship between our number of features \(D\) and the number of data points \(N\), where \(N > D\). The truth is that we may need exponentially more data points as we increase the dimension of the problem.

A simple analogy: If I gave you a budget of $500 (your data points) and 10 products (your features) to evaluate, you could easily buy all the products, test them, and tell me which are more or less important (the importance being your weights). But if I told you to evaluate 10,000 products, and I upped your budget to $500,000 - you wouldn't really be able to just evaluate all the products any more. You could hire others to help you, but then you wouldn't have enough money to buy all the products. And as you hire more people someone has to manage them, which is even more of your budget - your overheads are increasing! Obviously don't take this too literally, but hopefully it connects how just increasing one thing (the features) can have a bigger impact on what you need (data points) to solve the problem.

Q: I don't understand why we have different algorithms for classification/regression, aren't they all doing the same thing? Why isn't there a/ what is the 'best' algorithm?

A: There actually exists a proof that there can be no one best algorithm. The fact of the matter is all of these algorithms are models trying to approximate the world, and each model will make various different assumptions about the world. So we try to chose the model which best matches our understand of the world, or at least the one that performs best if we don't know how to world really works!

Often there exists models that make very few assumptions about the world - like nearest neighbor / decision trees. And as a consequence of that, we often see them perform very well on a variated of problems. But even these have some implicit assumptions.

A normal decision implicitly tree assumes that separations occur on axises. This means we will require a tree of infinite depth to "truly" model anything that doesn't fit on an axis. But we can often approximate them well enough with this, and so it works. But if we know that there is a lot of curvature / bending in the data, then it may be worthwhile using a method that assumes that behavior is part of the world, as it will probably have an easier time learning. This other model would then be better for the problem.

That doesn't mean decision trees would do badly, or that there aren't any number of models that would get equally good results. It just means we have to pick and choose what models/tools we use for the various problems we encounter.