My problem, is that most of Weka is inexcusably slow - and doesn't provide enough algorithms for you to really chose the right tool for your job. The code base is massive for the few algorithms it does support, and its bloated with code duplication and inefficient design. The biggest time suck in ML is usually model building / testing, and so I've put together a small benchmark on MNIST.
MNIST, as a data set, is often overused. However it its just large enough that if your code is inefficient you will feel it. We also know very well what accuracies are obtainable with various models on MNIST, making it a good sanity check. So I downloaded an ARFF version of the data from here and made this benchmark (hosted here). The purpose of this benchmark is to show that JSAT has significantly faster implementations of many of the same algorithms, and has better algorithm to use as well.
For this benchmark I've tried to keep everything apples-to-apples, and did my best to make sure each algorithm was doing the same thing and had the same options set. Most of the cases I made JSAT use the parameters Weka uses by default when possible. For some algorithms (like SVMs) that need a good set of parameters, I used values from a grid search I did before with LIBSVM.
All the times below are presented in seconds, run on my iMac (2.66 Ghz Core i5, 16GB of RAM) with nothing else. I've omitted the train/test time for algorithms where training/testing is not a performance issue.
Algorithm | Weka Time | Weka Error | JSAT Time | JSAT Error | JSAT Speedup |
---|---|---|---|---|---|
SVM w/ RBF Kernel (Full Cache) | Train: 7713.549 Test: 1339.633 |
0.0155 | Train: 3661.7 Test: 337.765 |
0.0156 | Train: 2.1x Test: 4.0x |
SVM w/ RBF Kernel (No Cache) | Train: 5657.924 Test: 1336.557 |
0.0155 | Train: 2558.846 Test: 317.663 |
0.0156 | Train: 2.2x Test: 4.2x |
RBF SVM stochastic w/ 5 iterations | Train: 518.654 Test: 10.443 |
0.0241 | Train: 10.9x Test: 128.0x (Over SVM) |
||
RBF SVM RKS features w/ Linear Solver | Train: 68.398 Test: 0.571 |
0.0293 | Train: 82.7x Test: 2340x (Over SVM) |
||
C4.5 Decision Tree | Train: 303.373 | 0.1134 | Train: 117.785 | 0.1146 | Train: 2.6x |
Random Forest w/ 50 trees | Tain: 143.127 | 0.0326 | Train: 100.673 | 0.0453 | Train: 1.4x |
1-NN (brute force) | Test: 2537.483 | 0.0309 | Test: 648.71 | 0.0309 | Test: 3.9x |
1-NN (Ball Tree) | Train: 52.263 Test: 3269.183 |
0.0309 | |||
1-NN (Cover Tree) | Train: 538.132 Test: 2245.709 |
0.0309 | |||
1-NN (VPmv) | Train: 1.909 Test: 493.69 |
0.0309 | |||
1-NN (Random Ball Cover) | Train: 13.448 Test: 576.737 |
0.0309 | |||
Logistic Regression by LBFGS, λ = 1e-4 | Train: 3301.899 | 0.0821 | Train: 907.259 | 0.0776 | Train: 3.6x |
Log Regression stochastic w/ 10 iterations | Train: 10.545 | 0.0840 | Train: 313x (over LBFGS) |
||
Logistic Regression OneVsAll DCD | Train: 276.865 | 0.080 | Train: 12.3x (over LBFGS) |
||
k-means (lloyd's algorithm) |
1010.6009 | 41.1913 | 24.5x | ||
k-means (Hamerly's algorithm) |
10.5358 | 95.9x | |||
k-means (Elkan's algorithm) |
5.1768 | 195x |
When running the same algorithms, most results end up with about the exact same accuracy. For the Random Forest JSAT does a little worse for some reason, and I can't find a setting in Weka to change. For Logistic Regression JSAT does a little better. However, in the case of Random Forest, you could trade some of your speed savings for a few more trees to make up the difference.
For SVMs, the speed advantage is about 2x for training and 4x for prediction. Speed advantages for the SVM are particularly important since the SVM is very sensitive to parameters, so a grid search is going to be needed, multiplying the runtime by a factor of 10x-100x depending on how many parameter combinations you want to test. The alternative algorithms in JSAT (Stochastic Kernel training directly & approximate feature space) up the advantage even further.
For the tree based, JSAT's advantage isn't as big. But if you are going to use trees in an ensemble (As is often the case), the 2.6x speedup is going to add up.
For Nearest Neighbor algorithms, both Weka and JSAT have data structures for accelerating nearest neighbor queries. While both of the ones from JSAT improved time and were fast to train, the Ball Tree algorithm in Weka was slower than the naive approach - and the Cover tree was only a little faster, but took almost as much time to train as JSAT did to do the whole problem to begin with!
For Logistic Regression, JSAT is again single digits faster, 3.6x. While LBFGS is a great and versatile tool to have, its not always the best choice especially for a bread-and-butter algorithm like Logistic Regression. JSAT's exact alternative exact solver (DCD) is the same algorithm used in LIBLINEAR, and is over 12x faster. However you don't always need an exact solution, and the SGD based solver in JSAT gets a solution almost as good as Weka's LBFGS and gets it 313 times faster.
Finally, for k-means, JSAT is already 24 times faster using the exact same algorithm. But JSAT also has two additional algorithms that obtain the exact same solution, but avoid redundant work - resulting in 95x to almost 200x faster training. With k-means being such a common tool both in use and as a building block for other algorithms, the difference is huge.
So, overall - JSAT has faster implementations of the same algorithms, and more algorithms - giving you the flexibility to pick the right tool for the job.
Just to be clear, I'm not claiming that JSAT has the fastest implementations ever of these algorithms. Many great tools, such as scikit-learn, have faster implementations for some algorithms. But JSAT does have a greater variety and if you are going to be working in Java, you should definitely consider JSAT over Weka just on the issue of speed.