Monday, May 29, 2017

New in JSAT 0.0.8!

Wow stuff has been busy. Health is improving and lots of other fun exciting things going on, including a new release of JSAT! I just pushed 0.0.8 to Maven, some of the new changes that I'm excited about:

The Convex Polytope Machine is a new classifier I've added, which is very similar in spirit to the AMM classifier. The CPM boasts a resilience to overfitting, but is binary only. I wasn't quite able to get results as good as the CPM paper, but my AMM implementation also performs significantly better than theirs - so I'm not going to read too into it. The CPM should be a good, relatively fast, and fairly accurate classifier for general use on any problem that will work well with a OneVsAll extension to multi-class problems.

I've also added some improved bounds to Hamerly's K-Means algorithm. I only did what's equivalent to "HamerlyB" in the paper, as the 3rd improvement seemed like it was going to be a little hairy from a code perspective. I might go back and revisit. Interestingly, these improvements had no impact for the standard MNIST dataset, which I was using to test the code, so I was very confused at first. But on other datasets I was able to see an improvement in both runtime and distance computations. 

I've generally improved a lot of the vector collections in JSAT and added a new one, definitely check the release notes for the list! A big one was going back and re-implementing the KDTree algorithm. I realized people we getting a bad impression of JSAT by comparing the KDTree implementation (written early on when I was dumb) to existing ones. It just wasn't a good implementation. On MNIST the new code is a bit over 300 times faster, and much more respectable. 

Some highlights from previous releases that I haven't written about yet: 

From 0.0.7, I want to make note of the new DC-SVM algorithm for training a kernel SVM! This new code is fast, and scales to much bigger problems than the previous code ever could. Expect a 4x-30x+ speedup if you are switching from the Platt SMO version in JSAT. I've used this new code to train an SVM on a dataset with 400,000 training points in a couple of hours, where the old SMO version would have taken weeks. My code does make a few departures from the paper, but there were details missing and some numbers that don't add up to me. The biggest difference is the SVM solver, as the standard Platt SMO won't work due to its bias term. Either way, I've gotten some great results. 

From 0.0.6, I implemented the HDBSCAN algorithm. HDBSCAN has been getting more popular over time, and deservedly so. It simplifies a lot of the problems with DBSCAN, provides a fairly intuitive/interpretable parameter to tune (that doesn't need a whole lot of tuning), and delivers some great clusters. My implementation is O(n2) due to the calculation of the Minimum Spanning Tree of the dataset. This could be reduced to a best case O(n log n) using some nifty algorithms, but their code would integrate more easily once I've done the Java 8 jump. I might implement an approximate MST algorithm in the interim so that the code can be used on larger datasets. 

0.0.6 also included the LargeViz algorithm. Think of it as t-SNE, but consistently good results rather than needing a bit of fiddling sometimes. Try it out! 


Thats all for now! I'm hoping to make a post about some of my more recent research that's about to be published (done using JSAT!), and will see how interested people are in hearing about that. Now that things aren't quite as crazy I can hopefully start blogging a bit more. 

Wednesday, January 4, 2017

A New Year and next JSAT version

Happy new year everyone! I'm recovering from surgery nicely, and hoping to start the new year with a more upward trend on matters related to my health. This has really slowed down a lot of my plans for JSAT, but I hope it still be useful to everyone. I'm going to prepare a release of the current changes, which may be of interest to some. I added two new SVM classes, one called SVMnoBias, which is useful for most kernels (it has to be normalized, which the RBF kernel and similar friends are) people are interested in. This new SVM class is faster than the PlattSMO one, and used to implement the Divide and Conquer SVM solver, which is supposed to be very fast despite being an approximation when using early stopping (which is the default). I haven't tested this on any super large datasets yet, but initial results looked promising. The Kernel K-Means also got improvements in runtime.

Some of the stuff I'm planning to work on in the near future include a follow up to the AMM, called the Convex Polytope Machine. I'm also hoping to round out some more of the kernel methods, as a few recent papers have proposed some interesting looking approximate feature space algorithms for the RBF kernel that would be a good improvement of the Random Fourier Features currently implemented. I've also received a few requests about Neural Networks in JSAT. While I think there are a lot of much better Neural Network tools out there (especially when it comes to using GPUs), I'm going to go back and add Batch Normalization as a layer option. I've thought about adding convolutions as well, but I think there are too many great tools for that use case. Anything I implement is going to be considerably slower, especially without a GPU, so I'd rather people use other, better, tools instead.

Longer term I'm thinking of putting some of the improvements discussed in the XGBoost paper into JSAT.