The Convex Polytope Machine is a new classifier I've added, which is very similar in spirit to the AMM classifier. The CPM boasts a resilience to overfitting, but is binary only. I wasn't quite able to get results as good as the CPM paper, but my AMM implementation also performs significantly better than theirs - so I'm not going to read too into it. The CPM should be a good, relatively fast, and fairly accurate classifier for general use on any problem that will work well with a OneVsAll extension to multi-class problems.
I've also added some improved bounds to Hamerly's K-Means algorithm. I only did what's equivalent to "HamerlyB" in the paper, as the 3rd improvement seemed like it was going to be a little hairy from a code perspective. I might go back and revisit. Interestingly, these improvements had no impact for the standard MNIST dataset, which I was using to test the code, so I was very confused at first. But on other datasets I was able to see an improvement in both runtime and distance computations.
I've generally improved a lot of the vector collections in JSAT and added a new one, definitely check the release notes for the list! A big one was going back and re-implementing the KDTree algorithm. I realized people we getting a bad impression of JSAT by comparing the KDTree implementation (written early on when I was dumb) to existing ones. It just wasn't a good implementation. On MNIST the new code is a bit over 300 times faster, and much more respectable.
Some highlights from previous releases that I haven't written about yet:
From 0.0.7, I want to make note of the new DC-SVM algorithm for training a kernel SVM! This new code is fast, and scales to much bigger problems than the previous code ever could. Expect a 4x-30x+ speedup if you are switching from the Platt SMO version in JSAT. I've used this new code to train an SVM on a dataset with 400,000 training points in a couple of hours, where the old SMO version would have taken weeks. My code does make a few departures from the paper, but there were details missing and some numbers that don't add up to me. The biggest difference is the SVM solver, as the standard Platt SMO won't work due to its bias term. Either way, I've gotten some great results.
From 0.0.6, I implemented the HDBSCAN algorithm. HDBSCAN has been getting more popular over time, and deservedly so. It simplifies a lot of the problems with DBSCAN, provides a fairly intuitive/interpretable parameter to tune (that doesn't need a whole lot of tuning), and delivers some great clusters. My implementation is O(n2) due to the calculation of the Minimum Spanning Tree of the dataset. This could be reduced to a best case O(n log n) using some nifty algorithms, but their code would integrate more easily once I've done the Java 8 jump. I might implement an approximate MST algorithm in the interim so that the code can be used on larger datasets.
0.0.6 also included the LargeViz algorithm. Think of it as t-SNE, but consistently good results rather than needing a bit of fiddling sometimes. Try it out!
Thats all for now! I'm hoping to make a post about some of my more recent research that's about to be published (done using JSAT!), and will see how interested people are in hearing about that. Now that things aren't quite as crazy I can hopefully start blogging a bit more.
Been playing around with JSAT for a while now. Recently used it in a production setting with great success. Really impressive. Looking forward to read about your research in the upcoming posts. Cheers!
ReplyDeleteGlad to hear it! If you have time to send an email I would love to get any details on how/what parts you are using, and what could be improved.
Delete