Monday, May 29, 2017

New in JSAT 0.0.8!

Wow stuff has been busy. Health is improving and lots of other fun exciting things going on, including a new release of JSAT! I just pushed 0.0.8 to Maven, some of the new changes that I'm excited about:

The Convex Polytope Machine is a new classifier I've added, which is very similar in spirit to the AMM classifier. The CPM boasts a resilience to overfitting, but is binary only. I wasn't quite able to get results as good as the CPM paper, but my AMM implementation also performs significantly better than theirs - so I'm not going to read too into it. The CPM should be a good, relatively fast, and fairly accurate classifier for general use on any problem that will work well with a OneVsAll extension to multi-class problems.

I've also added some improved bounds to Hamerly's K-Means algorithm. I only did what's equivalent to "HamerlyB" in the paper, as the 3rd improvement seemed like it was going to be a little hairy from a code perspective. I might go back and revisit. Interestingly, these improvements had no impact for the standard MNIST dataset, which I was using to test the code, so I was very confused at first. But on other datasets I was able to see an improvement in both runtime and distance computations. 

I've generally improved a lot of the vector collections in JSAT and added a new one, definitely check the release notes for the list! A big one was going back and re-implementing the KDTree algorithm. I realized people we getting a bad impression of JSAT by comparing the KDTree implementation (written early on when I was dumb) to existing ones. It just wasn't a good implementation. On MNIST the new code is a bit over 300 times faster, and much more respectable. 

Some highlights from previous releases that I haven't written about yet: 

From 0.0.7, I want to make note of the new DC-SVM algorithm for training a kernel SVM! This new code is fast, and scales to much bigger problems than the previous code ever could. Expect a 4x-30x+ speedup if you are switching from the Platt SMO version in JSAT. I've used this new code to train an SVM on a dataset with 400,000 training points in a couple of hours, where the old SMO version would have taken weeks. My code does make a few departures from the paper, but there were details missing and some numbers that don't add up to me. The biggest difference is the SVM solver, as the standard Platt SMO won't work due to its bias term. Either way, I've gotten some great results. 

From 0.0.6, I implemented the HDBSCAN algorithm. HDBSCAN has been getting more popular over time, and deservedly so. It simplifies a lot of the problems with DBSCAN, provides a fairly intuitive/interpretable parameter to tune (that doesn't need a whole lot of tuning), and delivers some great clusters. My implementation is O(n2) due to the calculation of the Minimum Spanning Tree of the dataset. This could be reduced to a best case O(n log n) using some nifty algorithms, but their code would integrate more easily once I've done the Java 8 jump. I might implement an approximate MST algorithm in the interim so that the code can be used on larger datasets. 

0.0.6 also included the LargeViz algorithm. Think of it as t-SNE, but consistently good results rather than needing a bit of fiddling sometimes. Try it out! 

Thats all for now! I'm hoping to make a post about some of my more recent research that's about to be published (done using JSAT!), and will see how interested people are in hearing about that. Now that things aren't quite as crazy I can hopefully start blogging a bit more. 

Wednesday, January 4, 2017

A New Year and next JSAT version

Happy new year everyone! I'm recovering from surgery nicely, and hoping to start the new year with a more upward trend on matters related to my health. This has really slowed down a lot of my plans for JSAT, but I hope it still be useful to everyone. I'm going to prepare a release of the current changes, which may be of interest to some. I added two new SVM classes, one called SVMnoBias, which is useful for most kernels (it has to be normalized, which the RBF kernel and similar friends are) people are interested in. This new SVM class is faster than the PlattSMO one, and used to implement the Divide and Conquer SVM solver, which is supposed to be very fast despite being an approximation when using early stopping (which is the default). I haven't tested this on any super large datasets yet, but initial results looked promising. The Kernel K-Means also got improvements in runtime.

Some of the stuff I'm planning to work on in the near future include a follow up to the AMM, called the Convex Polytope Machine. I'm also hoping to round out some more of the kernel methods, as a few recent papers have proposed some interesting looking approximate feature space algorithms for the RBF kernel that would be a good improvement of the Random Fourier Features currently implemented. I've also received a few requests about Neural Networks in JSAT. While I think there are a lot of much better Neural Network tools out there (especially when it comes to using GPUs), I'm going to go back and add Batch Normalization as a layer option. I've thought about adding convolutions as well, but I think there are too many great tools for that use case. Anything I implement is going to be considerably slower, especially without a GPU, so I'd rather people use other, better, tools instead.

Longer term I'm thinking of putting some of the improvements discussed in the XGBoost paper into JSAT.

Sunday, June 5, 2016

New in JSAT 0.0.4 and 0.0.5

Ah, I've gotten bad about making posts. I've recently started a PhD program and that has eaten up a lot of my time. I'm still using JSAT for stuff, and a lot of what I'll be adding to JSAT will be stuff I plan on using for my research.

With that, I also wanted to give an update on what's new in the last two released of JSAT. In 0.0.4 I added some support for missing values in data. The ARFF file format supports this, and uses "?" to indicate missing values. The loader used to throw these away, but now are included in the data. For numeric features, JSAT will indicate a missing value with NaN. For categorical values, a negative value will be used. I added this due to a charity project we were working on at my job that had a lot of missing values. It is a common problem we like to ignore, but decision trees can learn (and predict) from data with missing values in place. So I added that support. For all other algorithms, you'll probably want to use the Imputer data transform.

I also finally got around to improving the parallelism when training decision stumps/trees. It worked well for MNIST on my computer, but I havent super thoroughly performance tested it. Let me know how it works on your problem with varied datasets! I also improved the parallel performance of the kernelized k-means objects. An actual new addition is a much better hierarchical clustering algorithm based on nearest neighbor chains (called NNChainHAC in JSAT). This allows you to do many algorithms in O(n2) time (which is optimal) and O(n) memory. I'm not actually 100% sure I got the O(n) memory part working perfectly, as it takes more compute time than I can handle to keep testing larger and larger problems. Part of the problem is that, after initial distance computations, most of the work is in data access and merging to create the dendrogram. So to minimize memory use, you should be using a sparse hash map of the indices remaining to their merged scores. But the amount of work is so minimal that computing the hash is a huge portion of the work, and using a dense array ends up being at least 2x faster. Finally, one of the items I've meant to add a while is feature importance from decision trees. There are 3 algorithms implemented and can be integrated easily with Random Forest. The out of bag magic of RFs are used with the feature importance to get a ranking of features that is really useful and takes into account non-linear relationships. A pretty rare and useful property!

I'm currently working on some more cluster related stuff for a research project. Another item I attempted to work on recently was this recent LargeViz paper. Unfortunately the gradient I derived does not seem to work and makes me wish I was better / more confident at math :(. I did implement an initial version of the Random Projection trees with neighbor exploration, though it needs more testing. I'm also not sure what parameters they used to get such high accuracies in Figure 2. I tested on MNIST and got a significant performance drop using the trees - and it was only 2 times faster (for some parameters) than my minimum variance VP tree. I still want to play with it more though, and I think I could apply the same idea to VP trees to get a 2x speedup. The problem with the RP tree is that its the line orthogonal to two random points. So you can either figure out which side you are on by doing 2 distance computations at each node, or by storing a dense vector representing the line explicitly. The former takes up twice as much compute time, and the latter takes up a huge amount of memory. Making a VP "forest" could get around that since you only need 1 random point and a radius.

Friday, November 20, 2015

A Binary Format for JSAT Datasets

I made a post a while ago about improving the LIBSVM file loader in JSAT so that it wouldn't use nearly as much memory and be a good deal faster too, and I complained about everyone using human readable ASCII file formats. Spurred by a recent pull request, I've finally gone ahead and implemented a simple binary format for storing datasets in JSAT. I'm not using Java's serialization for this, so it's a binary format that one could just as easily implement a reader/writer for in other languages as well.

The binary format supports both sparse and dense storage of numeric features, and stores the string names for categorical features. Since floating point values take up the majority of space, it also supports saving them in multiple different methods. Currently it can save values as a 32 or 64 bit float, as a short, or as a signed/unsigned byte. The default method is to scan through the dataset and check which of the options would result in the smallest file without losing any information. Despite this overhead, it's faster than writing either an ARFF or LIBSVM file! You can also explicitly pass the method you want to store it as, which will skip the overhead and do a lossy conversion if necessary.

I did a simple performance case for reading/writing the training set of MNIST. For the JSATData I tested with sparse and dense numeric features (determined by how the data is stored in memory) and using the Auto/ Unsigned byte, and 64 bit float options. 

For this first table, I left the data normalized, so it was stored as integers from 0 to 255, making it easy for it to be saved as bytes. The JSAT writer for ARFF and LIBSVM writes out 0s as "0.0", so technically they have a some unnecessary padding. These numbers are from my Macbook Air which has a nice SSD.

Method ARFF LIBSVM JSATData FP64 (sparse) JSATData U_BYTE (sparse) JSATData AUTO (sparse) JSATData FP64 JSATData U_BYTE JSATData AUTO
Read Time (ms) 7810 3777 989 758 2839 1735
Write Time (ms) 7091 3322 790 586 1652 1894 895 1580
File Size (MB) 203.7 87.5 108.9 45.6 377.1 47.4

In this next table, I normalized the values to a range of [0, 1]. This makes the JSAT code AUTO select FP64, and uses a lot more text in ARFF and LIBSVM. Since U_BYTE won't work any more, I also did a force as FP32.
Method ARFF LIBSVM JSATData FP64 (sparse) JSATData AUTO (sparse) JSATDATA FP32 (sparse) JSATData FP64 JSATData AUTO JSATData FP32
Read Time (ms) 15247 6920 1032 941 2873 2765
Write Time (ms) 9445 5643 732 808 833 1794 1788 1933
File Size (MB) 318.3 202.1 108.9 72.7 377.1 188.7

You'll notice that the differences in write time didn't change so much for the second table between 64 bit and AUTO. This is because the code for auto detecting the best format will quit early once it has eliminated everything more efficient than a 64 bit float. As promised, the 64 bit format doesn't change in file size at all, which is a much more consistent and desirable behavior. And even when JSATData does not result in smaller file sizes, the format is simple making it much more IO bound, so it's much faster than the CPU bound ARFF and LIBSVM which have to do a bunch of string processing and math to convert the strings to floats.

I've also added a small feature for strings stored in the format, since I save out the names of categorical features and their options. There is a simple marker to indicate if strings are ASCII or UTF-16, that way for common ASCII strings not as much data is wasted. The writer will also auto-detect if ASCII is safe or it needs UTF-16.

I've written this format with JSAT's three main dataset types in mind, but hopefully this can be useful for others as well. If there is interest I may write a reader/writer for Python and C/C++ and host them up as small little projects on github. 

Monday, October 26, 2015

Visualization Algorithms in JSAT

For the next update of JSAT (you can try them early in the 0.0.3-SNAPSHOT release), I've been working on implementing a few algorithms meant specifically for visualizing datasets. For now I've chosen three of the more common / well known algorithms, MDS, Isomap, and t-SNE. Unlike PCA, these algorithms are mostly meant to transform only one dataset at once - the intention being to visualize that block of data. You can't necessarily apply these algorithms to new data points, nor would you necessarily want to. Their purpose is really to help explore data and learn more.

A classic test problem is the "Swiss roll" dataset. This is a dataset where some of the data, such as the red and blue points below, are close based on the euclidean distance - but our intuition about the data is that they are really far away from each other in actuality. So a good 2D visualization of this data would place those data points far away from each other while keeping the red/orange close together.

t-SNE doesn't do a great job at visualizing the swiss roll data, but for more general datasets t-SNE is a great algorithm. The implementation in JSAT is the faster Barnes Hut approximation, so it runs in O(n log n) time, making it applicable to larger datasets. Below is a visualization of t-SNE on the MNIST dataset, which makes it easy to see 10 clusters, 1 for each datapoint. Looking at the relationships we can see stuff that makes sense. The 9 and 4 clusters are near each other, which makes sense - the noisy 9s and 4s could easily be misconstrued. We also see that some noisy 9s in the cluster of 7s, another understandable case.

t-SNE was a bit of work to implement, and I had difficulty replicating the results using the exact details in the paper. A few changes worked well for me, and had the benefit of being a bit easier to code. Right now my tSNE implementation is a slower single-threaded then the standard implementation, but mine does support multi-threaded execution.

Hopefully these new tools will be useful for people trying to get a better intuition of what's happening in their data. All three algorithms support multi-threaded execution and implement a new VisualizationTransform interface. The Isomap implementation also includes an extension call C-Isomap, and I'm hoping to add more useful visualization algorithms in the future.

On another note, I've moved the GUI components outside of JSAT completely. Anything GUI related will now be in a new project, JSATFX. As you might guess from the name, I'm going to be trying to do more of the code in JavaFX so I can learn it. The visualizations above are from the library.

Saturday, July 11, 2015

Easier Hyperparameter Tuning

I've just released the newest version of JSAT to my maven repo and github, and the biggest change is some work I've done to make parameter tuning much easier for people who are new to Machine Learning or JSAT specifically.

The crux of this new code, from the user's perspective, is a new method on the GridSearch object autoAddParameters. This method takes a DataSet object, and will automatically add hyper parameters with values to be tested. These values are adjusted to the dataset given, that way reasonable results can be given even if the dataset isn't scaled to a reasonable range.

The goal if this is to help new users and people new to ML. The issue of tuning an algorithm has been the most common feedback I receive from users, and part of the issue is that new users simply don't know what values are reasonable to try for the many algorithms in JSAT. Now with the code below, new users can get good results with any algorithm and dataset.

import java.util.List;
import jsat.classifiers.*;
import jsat.classifiers.svm.PlatSMO;
import jsat.classifiers.svm.SupportVectorLearner.CacheMode;
import jsat.distributions.kernels.RBFKernel;
import jsat.parameters.RandomSearch;

 * @author Edward Raff
public class EasyParameterSearch
    public static void main(String[] args) throws IOException
        ClassificationDataSet dataset = LIBSVMLoader.loadC(new File("diabetes.libsvm"));
        ///////First, the code someone new would use////////
        PlatSMO model = new PlatSMO(new RBFKernel());
        model.setCacheMode(CacheMode.FULL);//Small dataset, so we can do this
        ClassificationModelEvaluation cme = new ClassificationModelEvaluation(model, dataset);
        System.out.println("Error rate: " + cme.getErrorRate());
         * Now some easy code to tune the model. Because the parameter values
         * can be impacted by the dataset, we should split the data in to a train 
         * and test set to avoid overfitting. 

        List<ClassificationDataSet> splits = dataset.randomSplit(0.75, 0.25);
        ClassificationDataSet train = splits.get(0);
        ClassificationDataSet test = splits.get(1);
        RandomSearch search = new RandomSearch((Classifier)model, 3);
        if(search.autoAddParameters(train) > 0)//this method adds parameters, and returns the number of parameters added
            //that way we only do the search if there are any parameters to actually tune
            PlatSMO tunedModel = (PlatSMO) search.getTrainedClassifier();

            cme = new ClassificationModelEvaluation(tunedModel, train);
            System.out.println("Tuned Error rate: " + cme.getErrorRate());
        else//otherwise we will just have to trust our original CV error rate
            System.out.println("This model doesn't seem to have any easy to tune parameters");

The code above uses the new RandomSearch class, rather than GridSearch - but the code would look the same either way. RandomSearch is a good alternative compared to GridSearch, especially when more than 2 parameters are going to be searched over. Using the diabetes dataset, we end up with an output that looks like

Error rate: 0.3489583
Tuned Error rate: 0.265625

Yay, an improvement in the error rate! A potentially better error rate could be found by increasing the number of trials that RandomSearch performs. While the code that makes this possible is a little hairy, so far I'm happy with the way it works for the user. 

Monday, June 1, 2015

Exact Models and Cross Validation from Warm Starts

One of the most common tasks when performing model building is to do some Cross Validation. Computationally speaking, this can be a very performance intensive deal. If you want to do 10-fold cross validation, you're basically going to be waiting ~10x as long for your model to build.

However, there is a potential speed up available for Cross Validation but only when using models that converge to an exact solution. First you train 1 model on all the data, and then train the 10 CV models using the first model as a warm start. This way we are training 11 models, but the 10 CV models will train significantly faster - and should be enough of a difference to offset the training time.

The use of models that converge to a singular exact solution is critical for this to be a sensible idea. The point of the CV is to keep data separate, and evaluate on data you haven't seen/trained on. By warm starting from a model trained on everything, the CV models have kinda seen all the data before. But because the model converges to an exact solution, it would be the same solution regardless of the warm start. The model basically forgets about the earlier data.

If we had used a model that does not converge to an exact solution, or has multiple local minima - this approach would bias the CV error. In the former case, we may not move all the way from the first solution we found. Being closer to the initial solution means we may have enough bias left to "remember" some of what we saw. This is even worse in the later case - the initial solution on all the data may put you in a local minima that you wouldn't have been in if you hadn't warm started, which basically guarantees you won't be able to get too far away.

Hopefully I'll get some time to experiment with actually implementing this idea.