Tuesday, February 18, 2014

Fail Fast for Model Building?

Fail Fast is a concept that has been around for a while, and is something I'm very fond of for software development. A simple way to state the goal for a programer is as follows: Attempt to throw an error sooner rather than later.

This doesn't mean to simply throw errors because you can. What we want is to identify situations where we know that the procedure can't possibly succeed. Instead of waiting for the process to fail on its own, we will fail the process (or throw an error or alert someone) the instant we know that failure is the only option.

But what about model training? Can we develop models where we know that model is in a bad spot, and terminate early? The thought came to me as I was implementing the AMM algorithm, as the published results seemed incredibly good. Not quite too good to be true, as I hadn't seen much other discussion of the paper. But something was missing. You can see their reported times here, and the AMM algorithm is almost as accurate as the RBF SVM while being only a bit slower than the linear SVM. But those times are just for the final model, how did the select the parameters?

The answers, as it often is - grid search. When I rand the same algorithm through Grid Search I quickly discovered the flaw, that AMM takes an order of magnitude or 2 longer to train when given bad parameters for the specific data. So can we really say that AMM is so fast? On new data we will have to run grid search - and it is going to take us at least as long as the slowest model to get all of our results.

This isn't a problem unique to the paper I've singled out. SVMs runtime is very heavily influenced by its regularization parameter C. Budgeted kernel methods exhibit the same behavior, so its not an issue of the number of support vectors. Even L1 regularized linear models in particularly are enormously slow when using very little regularization.  To me there seem to be 2 possible solutions.

  1. Develop flexible non-linear models that more strictly bounded computational complexity independent of the model's regularization. 
  2. Develop methods of detecting early when model will fail to reach a good solution
The first is, I think, the preferred option. Though I'm not entirely sure what it would look like. In some sense, budgeted kernel methods like the Forgetron already fit this description. But they don't quite fit the bill. We still need to determine the budget size and even when the budget is small to be fast, computational cost can be much higher when the model makes errors due to bad parameters (example, consider a projection step that only gets done on errors). 

The Computer Scientist option is the second one, which is what I'm talking about. I want to run grid search and have some of the models fail fast, where they simple stop training and report to the grid search that the parameter set was bad. 

I think some preliminary easy gains could be achieved with some work. Running a faster linear model at the first step gives us a good upper bound on our desired error rate. If the model thinks it wont be able to beat the error, why bother continuing? We have a linear model that is simpler to deal with that can reach the same goal. So go ahead and fail, even though we would have eventually reached some solution. 

The immediate option is to estimate our error periodically on a held out validation set, and give up if we don't see significant gains in performance over time. But this wont always work. This paper by Cotter et al shows that wouldn't always work for every model - as SMO based SVM algorithms don't start reaching a good solution until very close to the final solution. So such a strategy would be limited to models that have useful intermediate solutions. 

Perhaps someone knows of some similar work that is currently out there, but I think its a thought process worth investigating for the future.