Sunday, June 5, 2016

New in JSAT 0.0.4 and 0.0.5

Ah, I've gotten bad about making posts. I've recently started a PhD program and that has eaten up a lot of my time. I'm still using JSAT for stuff, and a lot of what I'll be adding to JSAT will be stuff I plan on using for my research.

With that, I also wanted to give an update on what's new in the last two released of JSAT. In 0.0.4 I added some support for missing values in data. The ARFF file format supports this, and uses "?" to indicate missing values. The loader used to throw these away, but now are included in the data. For numeric features, JSAT will indicate a missing value with NaN. For categorical values, a negative value will be used. I added this due to a charity project we were working on at my job that had a lot of missing values. It is a common problem we like to ignore, but decision trees can learn (and predict) from data with missing values in place. So I added that support. For all other algorithms, you'll probably want to use the Imputer data transform.

I also finally got around to improving the parallelism when training decision stumps/trees. It worked well for MNIST on my computer, but I havent super thoroughly performance tested it. Let me know how it works on your problem with varied datasets! I also improved the parallel performance of the kernelized k-means objects. An actual new addition is a much better hierarchical clustering algorithm based on nearest neighbor chains (called NNChainHAC in JSAT). This allows you to do many algorithms in O(n2) time (which is optimal) and O(n) memory. I'm not actually 100% sure I got the O(n) memory part working perfectly, as it takes more compute time than I can handle to keep testing larger and larger problems. Part of the problem is that, after initial distance computations, most of the work is in data access and merging to create the dendrogram. So to minimize memory use, you should be using a sparse hash map of the indices remaining to their merged scores. But the amount of work is so minimal that computing the hash is a huge portion of the work, and using a dense array ends up being at least 2x faster. Finally, one of the items I've meant to add a while is feature importance from decision trees. There are 3 algorithms implemented and can be integrated easily with Random Forest. The out of bag magic of RFs are used with the feature importance to get a ranking of features that is really useful and takes into account non-linear relationships. A pretty rare and useful property!

I'm currently working on some more cluster related stuff for a research project. Another item I attempted to work on recently was this recent LargeViz paper. Unfortunately the gradient I derived does not seem to work and makes me wish I was better / more confident at math :(. I did implement an initial version of the Random Projection trees with neighbor exploration, though it needs more testing. I'm also not sure what parameters they used to get such high accuracies in Figure 2. I tested on MNIST and got a significant performance drop using the trees - and it was only 2 times faster (for some parameters) than my minimum variance VP tree. I still want to play with it more though, and I think I could apply the same idea to VP trees to get a 2x speedup. The problem with the RP tree is that its the line orthogonal to two random points. So you can either figure out which side you are on by doing 2 distance computations at each node, or by storing a dense vector representing the line explicitly. The former takes up twice as much compute time, and the latter takes up a huge amount of memory. Making a VP "forest" could get around that since you only need 1 random point and a radius.