Machine Learning with JSAT: JSAT vs Weka on MNIST

I'm a strong proponent of using the right tool for the job, and I need to prefix this post with that fact that Java is not the best tool for Machine Learning in general. But you may know java best, you may need to interface with a Java infrastructure, or you might just know / prefer Java over other languages. For any number of reasons you may need or want to do some ML work in Java, and Weka is the choice that most people hear first.

My problem, is that most of Weka is inexcusably slow - and doesn't provide enough algorithms for you to really chose the right tool for your job. The code base is massive for the few algorithms it does support, and its bloated with code duplication and inefficient design. The biggest time suck in ML is usually model building / testing, and so I've put together a small benchmark on MNIST.

MNIST, as a data set, is often overused. However it its just large enough that if your code is inefficient you will feel it. We also know very well what accuracies are obtainable with various models on MNIST, making it a good sanity check. So I downloaded an ARFF version of the data from here and made this benchmark (hosted here). The purpose of this benchmark is to show that JSAT has significantly faster implementations of many of the same algorithms, and has better algorithm to use as well.

For this benchmark I've tried to keep everything apples-to-apples, and did my best to make sure each algorithm was doing the same thing and had the same options set. Most of the cases I made JSAT use the parameters Weka uses by default when possible. For some algorithms (like SVMs) that need a good set of parameters, I used values from a grid search I did before with LIBSVM.

All the times below are presented in seconds, run on my iMac (2.66 Ghz Core i5, 16GB of RAM) with nothing else. I've omitted the train/test time for algorithms where training/testing is not a performance issue.

Algorithm	Weka Time	Weka Error	JSAT Time	JSAT Error	JSAT Speedup
SVM w/ RBF Kernel (Full Cache)	Train: 7713.549 Test: 1339.633	0.0155	Train: 3661.7 Test: 337.765	0.0156	Train: 2.1x Test: 4.0x
SVM w/ RBF Kernel (No Cache)	Train: 5657.924 Test: 1336.557	0.0155	Train: 2558.846 Test: 317.663	0.0156	Train: 2.2x Test: 4.2x
RBF SVM stochastic w/ 5 iterations			Train: 518.654 Test: 10.443	0.0241	Train: 10.9x Test: 128.0x (Over SVM)
RBF SVM RKS features w/ Linear Solver			Train: 68.398 Test: 0.571	0.0293	Train: 82.7x Test: 2340x (Over SVM)
C4.5 Decision Tree	Train: 303.373	0.1134	Train: 117.785	0.1146	Train: 2.6x
Random Forest w/ 50 trees	Tain: 143.127	0.0326	Train: 100.673	0.0453	Train: 1.4x
1-NN (brute force)	Test: 2537.483	0.0309	Test: 648.71	0.0309	Test: 3.9x
1-NN (Ball Tree)	Train: 52.263 Test: 3269.183	0.0309
1-NN (Cover Tree)	Train: 538.132 Test: 2245.709	0.0309
1-NN (VPmv)			Train: 1.909 Test: 493.69	0.0309
1-NN (Random Ball Cover)			Train: 13.448 Test: 576.737	0.0309
Logistic Regression by LBFGS, λ = 1e-4	Train: 3301.899	0.0821	Train: 907.259	0.0776	Train: 3.6x
Log Regression stochastic w/ 10 iterations			Train: 10.545	0.0840	Train: 313x (over LBFGS)
Logistic Regression OneVsAll DCD			Train: 276.865	0.080	Train: 12.3x (over LBFGS)
k-means (lloyd's algorithm)	1010.6009		41.1913		24.5x
k-means (Hamerly's algorithm)			10.5358		95.9x
k-means (Elkan's algorithm)			5.1768		195x

When running the same algorithms, most results end up with about the exact same accuracy. For the Random Forest JSAT does a little worse for some reason, and I can't find a setting in Weka to change. For Logistic Regression JSAT does a little better. However, in the case of Random Forest, you could trade some of your speed savings for a few more trees to make up the difference.

For SVMs, the speed advantage is about 2x for training and 4x for prediction. Speed advantages for the SVM are particularly important since the SVM is very sensitive to parameters, so a grid search is going to be needed, multiplying the runtime by a factor of 10x-100x depending on how many parameter combinations you want to test. The alternative algorithms in JSAT (Stochastic Kernel training directly & approximate feature space) up the advantage even further.

For the tree based, JSAT's advantage isn't as big. But if you are going to use trees in an ensemble (As is often the case), the 2.6x speedup is going to add up.

For Nearest Neighbor algorithms, both Weka and JSAT have data structures for accelerating nearest neighbor queries. While both of the ones from JSAT improved time and were fast to train, the Ball Tree algorithm in Weka was slower than the naive approach - and the Cover tree was only a little faster, but took almost as much time to train as JSAT did to do the whole problem to begin with!

For Logistic Regression, JSAT is again single digits faster, 3.6x. While LBFGS is a great and versatile tool to have, its not always the best choice especially for a bread-and-butter algorithm like Logistic Regression. JSAT's exact alternative exact solver (DCD) is the same algorithm used in LIBLINEAR, and is over 12x faster. However you don't always need an exact solution, and the SGD based solver in JSAT gets a solution almost as good as Weka's LBFGS and gets it 313 times faster.

Finally, for k-means, JSAT is already 24 times faster using the exact same algorithm. But JSAT also has two additional algorithms that obtain the exact same solution, but avoid redundant work - resulting in 95x to almost 200x faster training. With k-means being such a common tool both in use and as a building block for other algorithms, the difference is huge.

So, overall - JSAT has faster implementations of the same algorithms, and more algorithms - giving you the flexibility to pick the right tool for the job.

Just to be clear, I'm not claiming that JSAT has the fastest implementations ever of these algorithms. Many great tools, such as scikit-learn, have faster implementations for some algorithms. But JSAT does have a greater variety and if you are going to be working in Java, you should definitely consider JSAT over Weka just on the issue of speed.

Machine Learning with JSAT

Sunday, March 29, 2015

JSAT vs Weka on MNIST

No comments:

Post a Comment

About Me