Thursday, October 16, 2014

Stacking

I've recently added Stacking as an Ensemble to JSAT. Unlike Boosting or Random Forests, Stacking is very general - it can be applied to any models (the weak learner in boosting must support weighted data, and RF are just special trees). It's been on my TODO list for a long while now, and so far I've gotten pretty good results with it.

The phenomena of simply combining the results of multiple models to improve performance is well known, and many competition submissions are just the averaged (or majority vote) predictions from several created models. The difference with stacking is that it learns a weighted vote based on the performance of each classifier, which should perform better than any individual model in the ensemble (and better than a naive averaging).

However, there is no reason why the combing phase has to be a linear model. So in my implementation the base aggregating model can be specified. Though I have been using linear models for my toying around.

The other thing I've not seen mentioned is that Stacking can be done in an online fashion. By updating the aggregating model before the models being ensembles, you can get unbiassed updates to the aggregator. I've been combing this with Passive Aggressive models as the aggregators and been getting some great results. I use the PA models because they adapt and learn very quickly, and don't need any parameter tuning to work well.

To visually see that it was working, I created a simple ensemble of 3 models on a 2D dataset. My base models are online logistic kernel machines, with 3 different but poorly chosen RBF kernels. One is far too large, one is just a bit too large, and the final one was too small that it overfit.



Then I applied my online Stacking implementation. As you can see, it clearly learns a model that is better than any individual model from the ensemble. This can be very useful in scenarios that don't require real time predictions.

Online Stacking of 3 Logistic Kernel Machines

In doing this I explicitly used a Logistic model because it produces output probabilities. This makes a huge difference in the performance of Stacking. Using the hinge loss just doesn't work as well, as show below.

Online Stacking of 3 Online SVMs

This is because with the hinge, you only get a "yes/no" predication and we need to learn the weight for each model based on that alone. If a model knew that it dosn't know what to predict and returned probabilities indicating such, we could learn a better weight that takes into account the confidence in predictions. In the case of the model 'not knowing' its vote will get washed out when making a prediction by contributing equally to all classes - and we can learn to exploit this. Here we show the probabilities of the 3 models, and it becomes much easier to see how they could be combined in a way that improves the overall accuracy, giving the above result.


The wide model gives us the votes for the top and left/right sides of the space, the middle model gives us most of the edges around the data, and the left model gives us the votes near the borders of the classes (lightened by the smoother border of the middle model).

One could also imagine learning rules to exploit various correlations between model's predictions. This is why I've made the aggregating model configurable rather than force everyone to use linear aggregators.

Sunday, October 5, 2014

Beginner Advice on Learning to Implement ML Algorithms

Jason Brownlee contacted me recently to ask if I could give my advice/opinion on a few questions (prompted by a post of mine on reddit). I'll be answering them here. He also asked about some optimization tricks which I'll answer in a later post.

How can you implement algorithms from scratch to learn ML?


The short, if unsatisfying, answer is practice. When you read a new algorithm or paper, there will be a lot of assumed knowledge. The authors have only a handful of pages to convey a new concept to the reader, so its up to you to read all their references and learn the field well enough that you can read in-between the lines, learn the tricks, and translate higher level papers to lower level code. Some of this will be a bit circular in logic, but the truth is practice is an iterative process. You have to go through everything in cycles and ‘bootstrap’ yourself up from confusion to understanding. The beginning may not be the hardest part, depending on what kind of methods you are interested in, but it will be the most discouraging period. That said, here are the “steps” I generally follow when implementing (or trying) a new algorithm.

1. Read the whole paper, no skimming. Then wait. There is little reason to believe you will grasp everything at first, and re-reading over and over immediately is just going to fatigue you and desensitize you to the words. You need time to let the paper sink in, think about how it relates to other algorithms and ideas, try to build mental model of everything that is happening. Then you go back and re-read the paper, enhance your understanding, and repeat until you reach diminishing returns.

Obviously different algorithms require different amounts of mental effort – a few of my ‘TODO’ papers have been on my list for a few years. But not matter how I feel, or how much I want to just ‘dive in’ and start writing some code – in the end I've never regretted doing this process. Overtime, it also builds up a much deeper understanding of how different algorithms are related and that’s when you can start coming up with your own ideas.

2. Come up with (or find) the simplest problem that is the most complicated that the algorithm can solve perfectly. This may be a bit confusing to read, but essentially you want a toy problem that will force the algorithm to exhibit desirable behavior but will also allow you to get a consistently perfect result.

Doing this often requires some level of understanding of the algorithm in the first place, so in some cases can be a bit tricky. Besides helping to make you think about the algorithm and what it is actually solving, it is also a huge boon when developing and testing your implementation. Especially if you can visualize the problem and solution. For this reason I often create 2D problems for initial development.

First this provides a useful unit test of macro functionality. If you ever go back to improve / refactor / modify your code, it may catch you breaking something on accident.

Second, it can help you catch “near working” implementations. I used to always start with replicating results on common benchmark datasets, such as MNIST. But it is possible to implement an algorithm but fail to account for certain corner cases or code that doesn't quite solve the right problem, but happens to return good accuracies. This is a particularly notorious problem in Neural Networks, and even happens at the algorithm level in top academic journals.

3. Speed comes last. There are lots of performance tricks to get code and ML type algorithms moving faster, but they often clutter the code. When initially developing, it’s more important to get the logic right. This is part of the observation that debugging code is much harder than writing it. If you attempt to be fast before you have confirmed the logic is working, you will have to first determine if it is a problem with the algorithm itself, or the optimization you've done. It’s an extra step that simply isn't needed. Its also much easier to validate optimizations as both correct and a improvement when you can compare them to a base implementation.

4. Know what you are built upon. Given a Linear Algebra library (BLAS/LAPACK), a number of algorithms become very easy to implement efficiently. You should almost always use these instead of implementing them from scratch like I do. My purpose for going from scratch is self understanding and education. In reality this provides almost not practical benefits for my library or code. However, if you do implement an algorithm ontop of these tools - keep in mind that you are using a tool that likely has over 30 decades of built up knowledge, algorithm development, testing, and performance chasing. Try implementing the methods you call from scratch when you have a chance and see how it impacts performance, just so you can get an appreciation of how much work is done for you.

What are some traps that you see?

Most all coding advice applies here (especially any warning about floating point), but I will mention a few ML centric mistakes I often see (in general and for beginners). 

1. Don't assume the paper is perfect or even correct at all. I've implemented a number of papers that had significant errors in the paper that needed to be fixed before the algorithm worked. Peer review is far from perfect, and serious errors will get by. 

I've attempted a few papers I was never able to reproduce the results of. Some were simply missing too many details, but seemed like they could work. Sometimes there are minor mistakes in the paper, or an equation that never got updated/corrected in the final version. Rarely I've suspected papers of having serious issues in the algorithm or evaluation. 

2. Don't try and get a “math free” understanding. This is a particular problem for beginners – especially those weak in math. You can’t understand what you are implementing without the math. You can often implement it without a full understanding, and that’s fine – you can't master all of everything. But don't try to get by with none of it. 

If you really want to implement ML algorithms but feel too lacking in the math department, just keep iterating. Try your best, learn, and be frustrated. Eventually it will get a little easier. I personally feel very weak in my math skills, but reading back the same papers I read 3/4 years ago and I'm amazed at how much more I understand and can comprehend. 

3. Don't start with other people’s source code! I'll let you read my reddit comment for the whole spiel. But the short of it is that most people don't actually learn anything by reading the source code if they don't understand what the algorithm is in the first place. 

4. A common question I see is some variant of “how accurately could ML predict X?” or “how would I apply ML to doing X?” While it may seem reasonable to someone new to the field, these are very poor questions to ask. 

In some instances, if the problem is very similar to one that has been solved before, you could point to those previous results as a guess. With more experience, you might be able to hazard a good guess if you know some details about the problem itself. But there are a huge number of variables about the problem that someone isn't going to know about your data, or domain expertise on the problem they simply don't have. 

Even if they did have this information, there is application context that is missing. Does the solution need to run in real time? Are the resources for training the model limited? Does the model have to run on an embedded device? Is ‘accuracy’ the real goal, or is it something else? If accuracy is the goal, what level of accuracy is needed? Are reasonable probabilities needed? Will the model be updated over time? Creating and applying a Machine Learning solution has to take in a lot of factors. 

Instead, as you are learning, go look for already existing ML solutions and read about how they were done or how they work. Once you've learned and seen a number of solutions and the work involved, you can hopefully apply some of those thoughts and ideas to your own problems. 

5. Using the default random number generator. This may seem like an odd one, but it’s a bit nuanced. A number of machine learning algorithms rely on randomness. So you use the PRNG built into your language of choice and everything is great. Except sometimes not, but that’s how randomness works, right? But a number of languages and libraries (especially Java) have very poor quality PRNGs as the default. While there is no need for cryptographically secure PRNGs, switching to a decent quality generator can save a lot of headache debugging behavior that is very difficult to diagnose as coming from the PRNG. I've only had this really impact me a few times, but switching away from the default has been a big help. This is particularly important if you plan on using an algorithm that revolves around randomness. 

6. Using Python or Ruby or some other interpreted language. This is some of my own personal bias here, but really implementing an algorithm efficiently isn't a fun exercise in these languages. You need a faster JITed or fully AOT compiled language so you can figure out where your code is really slow. There are ways around this in Python & friends, but at that point you learning more about implementing in Python, rather than the algorithms themselves

What resources would you suggest?

This is a hard question to answer, because it depends a lot on what your interests are and your background is. A lot of people get into ML from other fields because its useful and has a lot of interesting parts. Linguists are involved a lot in the Natural Language Processing side, I know of a number of physicist and math guys who like some of the more applied and theoretical parts. Topic modeling is evidently growing as a tool for History and English departments to analyze large swaths of text that would be too time consuming to do as an individual. So I'm going to instead list some resources and people I like. 

In my experience, replicating results is often the best way to get good at implementing algorithms. So what I'm considering a 'good' resource for getting better at implementing ML algorithms is a a combination of 3 items

  • Pseudocode description of algorithm. Some papers scatter the details of the algorithm out, making them a bit hard to follow 
  • Explanation of math details for the algorithm. Pseudocode alone isn't always enough, and can be very vague. While you should hopefully get to a point where a derivative or the form of a function could be done yourself, to start its a detail you probably want made explicit. 
  • Reproducible results. Many papers gloss over important details, like parameters tested and how parameter selection is done. But without that information you can't re-create the results to confirm your code works! Even better is when code is provided so you can compare results (not just copy their code). 


Online Course:

You've probably found it or head of it by now. But Andrews Ng's Machine Learning course on Coursera is quite good for raw beginners. He also takes you through an example of decomposing a problem (OCR) as a series of steps that can be solved with ML algorithms. This is most likely the place to start if you are a complete beginner. 

Books:


This is a very common book for people to learn and get, especially since it is free. In my opinion, the book isn't particularly great in any area – but its not particularly bad at anything either.  Though if you are more of a visual learner this book has lots of graphs and diagrams to try and give intuitions of how things work / what's going on. If your goal is to implement for learning, recreating their visual results would be an excellent exercise. 


This is currently my favorite ML book. Murphy does an excellent job explain the algorithms, and relating them to each other to help foster a deeper understanding. Though the learning curve is a bit wonky at times, and the later chapters don’t have quite as much lower level details – its overall excellent. For implementing in particular, much more refined (and explained) pseudo code is present for many of the algorithms and for many of the chapters and algorithms goes through the math needed to develop these algorithms. Occasionally some implementation considerations are discussed. While not often, it is more than most books on Machine Learning.  

People in General: 


Chih-Jen's group consistently puts out very high quality work, including the widely used LIBSVM and LIBLINEAR packages. Their papers are especially good if you are looking for algorithms to implement. He always describes the algorithm in pseudo code with a good amount of detail, including implementation details and choices made with the implementation in mind. They are also meticulous at making their work reproducible, and providing very detailed experimental procedure so you can replicate it with their code or your own. 


Almost every paper I've read that has had Nathan as an author has also provided good pesudo code to follow and implement from. 


Not a huge number of papers, but I've generally found his very concisely described if you are just interested in implementing it. 


Perhaps too ambitious for beginners, Pedro has a lot of very interesting papers that are well described. Not always, but he also provides some code implementations that you can compare against. 

Some papers in particular: 

Bob Carpenter's Lazy Logistic Regression is a good paper to learn from, and goes through the details and derivation of a logistic regression algorithm. He also has a good blog. 

Greg Hamerly's Making k-means even faster is a great algorithm in general (every library's k-means implementation should switch from lloyds to this as the default). But the paper itself is also unique in covering a lot of the implementation tricks used to make it fast (an unfortunate rarity). 

Lawrence Cayton's Accelerating Nearest Neighbor Search on Manycore Systems is an algorithm for accelerating Nearest Neighbor queries. This one doesn't have any pseudo code, but the paper is detailed and also talks about implementation considerations. Its also a good exercise when you get more comfortable to extend the algorithm from just nearest neighbor queries to range queries. 

Hal Daume's Notes on CG and LM-BFGS Optimization of Logistic Regression is probably too much for a first attempt. But its a good starting place for your first really complicated algorithm, as he has assembled all the parts together for you. You'll still be missing a lot of the theory and math, as the paper alone isn't sufficient background for what's being implemented. But its a very practical goal oriented discussion and detailing. Implementing it will give you a good appreciation for the amount of work in writing a debugging more significant code.