Machine Learning with JSAT

Sunday, March 11, 2018

Reflecting on two years of Academic Publishing & Reviewer #2

March 2nd I defended my PhD thesis at UMBC! I can now call myself Dr Raff and be not-quite the Jewish doctor my mom had always hoped for!

It's been a whirlwind of a journey, which started a bit over two years ago. I've been very fortunate to have a lot of support from my work, client, friends, and family that allowed me to do this. It makes me hesitant to talk about advice and overall thoughts on the processes, given that mine has been far from normal and quite blessed in terms of resources.

But I do want to write down some of my thoughts on publishing, at least from my limited experience. Before 2016, I had never written a real academic paper before - let alone publish one in a peer-reviewed venue. I've now got 9 peer reviewed papers, 2 papers under revision, 3 papers under review, and two new papers being written. For a two year window I feel that has gone quite well, and overall I've enjoyed being able to share what I've done - but I can't say I've enjoyed the sharing processes itself.

Writing a paper is something I had thought about and knew would be a requirement if I ever did a PhD, but it was something I feared. Written communication has always been a weakness of mine due in part to dyslexia, and I had no concept of what it was like to write a real paper and get it published. There was and still are tons of resources online about how to write the paper itself, but none that I've found talking about how it feels.

The writing of papers itself has so far been surprisingly fun. You've got your intuition about the problem in your head and some ugly code that solves it. Now you are modeling your ideas into a more formal notion - something that can be shared with a wider audience. I've found the writing process itself helps me devise more experiments to run. Imagining I was reading this paper and someone else had written it, what would I like to know more about? What would have convinced me? Then go do that and add it in!

But thats all the writing, then comes the dread.

Submitting your paper for review. You rush to the deadline and get it in, and now you wait. You wait for 2-4 months to hear back from 2-3 of your peers about your paper. When first starting, I assumed reviewers would look for the same things I looked for in papers:

Did I learn something new from this? (Or if you are math-dumb like me, did I understand it at all?)
Are the contributions of the paper helpful to some set of problems / needs, and how big is that set?
Do I gain any new capabilities, or improve existing capabilities?
Do the contributions make my life easier, even if I don't have any new or improved abilities.

At a high level theses seemed reasonable, and I was told that reviewers should be looking for these things. But then I got my first set of reviews back, and second, and third... and I've been surprised at how needesly negative reviewers can be. I should caution that most of my papers have been in the space of ML + Security (specifically malware), so I've got a biased sample pool. But I've found that I almost always get a negative review that has differing complaints. My impression has been that reviewers are often instead looking for (from the reviewer's apparent perspective):

Do I think the solutions is obvious.
Did the paper answer a question I had. Its corollary: does the paper avoid contradicting anything I believe.
Does the contribution solve a problem I care about.
Do the contributions result in improvements to standard benchmarks.

I've emphasized the "I"s in theses, as I believe this is where a lot of the problem comes from. Even when my papers are accepted, there is usually reviewer #2 who is dissenting and dismissive. With a cavalier tone and disparagements that can suck the life out of you. The reviewer seeing only themselves, and perhaps how smart their review makes them seem - without care that there is a person behind the paper. A person who has spent months, possibly years on this paper - the results, the experiments, the writing itself, and has been waiting for judgment for months.

One of my favorites is complaints on novelty, as brought up in a recent bit of tweeting (which inspired me to finish this blog post). One of my favorite papers so far, and the one at the most prestigious venues I've gotten a paper to, was creating LZJD at KDD. It was inspired by a 2004 paper that introduced the Normalized Compression Distance (NCD), a method of measuring similarity between arbitrary things via compression algorithms. While my paper was accepted, I did have one reviewer who disparage it as "lacking novelty" - complaining that anyone could have come up with it. Yet no one did come up with it. Despite the many follow up papers to NCD, for at least a decade, no one had created the same idea. This review was simply a list of complaints, with almost platidunal strengths - as if patting me on the head for trying hard.

While they clearly thought it was obvious at LZJD could be made, they clearly didn't. And neither had anyone else apparently. And maybe the reviewer truly is just better versed in that area, and would have quickly devised my same solution as their first pass. But that doesn't mean it was obvious to everyone, and that doesn't mean it isn't novel.

Sitting on the note of novelty - I also have found frustration at its amorphous definition in the eye's of the reviewers. There is an apparent belief among many that novelty is synonymous with complexity; that simple solution's can't be novel by definition. This is something that has increasingly caused me anxiety as I write papers - as I personally try to strive for simpler solutions. Simpler solutions are easier to implement, to maintain, to debug, to share, to replicate, and to understand. I like simple simple solutions. But current publishing seems to penalize this, and I've found myself on occasion being drawn to the idea of stating things in a more formal or mathematical way - not for any analytical or formal benefit, but because reviewers explicitly ask me to. So if I do it from the start, perhaps they will be happy from the start?

Another fun form of dismissiveness is the not-my-problem problem. That the work isn't interesting because the reviewer doesn't personally have this problem. Its something I think big company research labs are making worse.

For our MalConv paper, one of the early reviews asked us for extensive results testing many different architecture sizes and variants. While this review as a whole was not mean, the ask itself is untenable for our group. Training the model took a month on a DGX-1, a fact included in the paper, and we don't have any more of them! I've got a good amount of computational resources, but the ask of the reviewer would have taken all of my resources years to produce. A research at Google may have such luxuries, but I do not.

A better example this problem (and the 3rd in my bullet list) which includes the callousness I have come to fear came from my swell SHWeL paper, which was extending LZJD. In particular I had come up with the idea to help work around class imbalance problems. This is a big problem for our research in malware detection, as we don't have that much data. Its expensive to get, and companies haven't generally been willing to share. Reviewer #1 was glowing and talking of the paper's high quality. Reviewer #2 voted reject. Stating that the class imbalance wasn't important, and since I didn't deal with the adversarial problem the paper wasn't worth considering. Literally calling my paper good "academics" and stating it was a shame that the LZJD paper was published at all.

I almost cried the night I got that review. I thought it was a really good paper, and I agonized over where to send it and how to best polish it. After, I felt like an imbecile. Why was I doing a PhD? I wasn't worthy - I can't even convince people my work is decent. It felt like there was lead in my chest and it was going to drag me down to hell where I could burn with all the other bad researchers who didn't actually contribute anything.

In none of this am I including the frustration in dealing with reviewers who just don't seem to care or read the paper. It makes things agonizingly painful to know your paper was shot down for falsehoods. From reviewers claiming linear SVMs are equivalent to Lasso regularized Logistic Regression, to reviewers saying I should add a figure to show X when such a figure is labeled "Figure 1" and has word-for-word what they asked for. I've even had a reviewer simply claim my paper had insufficient experiments, when 5 pages were dedicated to every experiment I could find from prior works. It's a situation you just can't win.

This blog post is getting on the long side. So I'll wrap it up now. But I want to implore everyone who is ever reviewing: be polite and be constructive. I've had good negative reviews. Reviews that discussed the strengths of my paper and suggested how I could further improve them. Not simply listing off a list of sins I may have committed. All of us are trying to contribute something we felt was valuable enough to down into black and white.

To those who haven't published before: good luck, and I hope you have a good mentor! I want this post to prepare you for the worst of it, allowing you to better enjoy the good parts! I've felt the process a net positive in my life so far. With my defense done I have no career need to keep publishing, so I hope my continued effort to do so serves as an indicator that it is worth it overall.

Monday, May 29, 2017

New in JSAT 0.0.8!

Wow stuff has been busy. Health is improving and lots of other fun exciting things going on, including a new release of JSAT! I just pushed 0.0.8 to Maven, some of the new changes that I'm excited about:

The Convex Polytope Machine is a new classifier I've added, which is very similar in spirit to the AMM classifier. The CPM boasts a resilience to overfitting, but is binary only. I wasn't quite able to get results as good as the CPM paper, but my AMM implementation also performs significantly better than theirs - so I'm not going to read too into it. The CPM should be a good, relatively fast, and fairly accurate classifier for general use on any problem that will work well with a OneVsAll extension to multi-class problems.

I've also added some improved bounds to Hamerly's K-Means algorithm. I only did what's equivalent to "HamerlyB" in the paper, as the 3rd improvement seemed like it was going to be a little hairy from a code perspective. I might go back and revisit. Interestingly, these improvements had no impact for the standard MNIST dataset, which I was using to test the code, so I was very confused at first. But on other datasets I was able to see an improvement in both runtime and distance computations.

I've generally improved a lot of the vector collections in JSAT and added a new one, definitely check the release notes for the list! A big one was going back and re-implementing the KDTree algorithm. I realized people we getting a bad impression of JSAT by comparing the KDTree implementation (written early on when I was dumb) to existing ones. It just wasn't a good implementation. On MNIST the new code is a bit over 300 times faster, and much more respectable.

Some highlights from previous releases that I haven't written about yet:

From 0.0.7, I want to make note of the new DC-SVM algorithm for training a kernel SVM! This new code is fast, and scales to much bigger problems than the previous code ever could. Expect a 4x-30x+ speedup if you are switching from the Platt SMO version in JSAT. I've used this new code to train an SVM on a dataset with 400,000 training points in a couple of hours, where the old SMO version would have taken weeks. My code does make a few departures from the paper, but there were details missing and some numbers that don't add up to me. The biggest difference is the SVM solver, as the standard Platt SMO won't work due to its bias term. Either way, I've gotten some great results.

From 0.0.6, I implemented the HDBSCAN algorithm. HDBSCAN has been getting more popular over time, and deservedly so. It simplifies a lot of the problems with DBSCAN, provides a fairly intuitive/interpretable parameter to tune (that doesn't need a whole lot of tuning), and delivers some great clusters. My implementation is O(n2) due to the calculation of the Minimum Spanning Tree of the dataset. This could be reduced to a best case O(n log n) using some nifty algorithms, but their code would integrate more easily once I've done the Java 8 jump. I might implement an approximate MST algorithm in the interim so that the code can be used on larger datasets.

0.0.6 also included the LargeViz algorithm. Think of it as t-SNE, but consistently good results rather than needing a bit of fiddling sometimes. Try it out!

Thats all for now! I'm hoping to make a post about some of my more recent research that's about to be published (done using JSAT!), and will see how interested people are in hearing about that. Now that things aren't quite as crazy I can hopefully start blogging a bit more.

Wednesday, January 4, 2017

A New Year and next JSAT version

Happy new year everyone! I'm recovering from surgery nicely, and hoping to start the new year with a more upward trend on matters related to my health. This has really slowed down a lot of my plans for JSAT, but I hope it still be useful to everyone. I'm going to prepare a release of the current changes, which may be of interest to some. I added two new SVM classes, one called SVMnoBias, which is useful for most kernels (it has to be normalized, which the RBF kernel and similar friends are) people are interested in. This new SVM class is faster than the PlattSMO one, and used to implement the Divide and Conquer SVM solver, which is supposed to be very fast despite being an approximation when using early stopping (which is the default). I haven't tested this on any super large datasets yet, but initial results looked promising. The Kernel K-Means also got improvements in runtime.

Some of the stuff I'm planning to work on in the near future include a follow up to the AMM, called the Convex Polytope Machine. I'm also hoping to round out some more of the kernel methods, as a few recent papers have proposed some interesting looking approximate feature space algorithms for the RBF kernel that would be a good improvement of the Random Fourier Features currently implemented. I've also received a few requests about Neural Networks in JSAT. While I think there are a lot of much better Neural Network tools out there (especially when it comes to using GPUs), I'm going to go back and add Batch Normalization as a layer option. I've thought about adding convolutions as well, but I think there are too many great tools for that use case. Anything I implement is going to be considerably slower, especially without a GPU, so I'd rather people use other, better, tools instead.

Longer term I'm thinking of putting some of the improvements discussed in the XGBoost paper into JSAT.

Sunday, June 5, 2016

New in JSAT 0.0.4 and 0.0.5

Ah, I've gotten bad about making posts. I've recently started a PhD program and that has eaten up a lot of my time. I'm still using JSAT for stuff, and a lot of what I'll be adding to JSAT will be stuff I plan on using for my research.

With that, I also wanted to give an update on what's new in the last two released of JSAT. In 0.0.4 I added some support for missing values in data. The ARFF file format supports this, and uses "?" to indicate missing values. The loader used to throw these away, but now are included in the data. For numeric features, JSAT will indicate a missing value with NaN. For categorical values, a negative value will be used. I added this due to a charity project we were working on at my job that had a lot of missing values. It is a common problem we like to ignore, but decision trees can learn (and predict) from data with missing values in place. So I added that support. For all other algorithms, you'll probably want to use the Imputer data transform.

I also finally got around to improving the parallelism when training decision stumps/trees. It worked well for MNIST on my computer, but I havent super thoroughly performance tested it. Let me know how it works on your problem with varied datasets! I also improved the parallel performance of the kernelized k-means objects. An actual new addition is a much better hierarchical clustering algorithm based on nearest neighbor chains (called NNChainHAC in JSAT). This allows you to do many algorithms in O(n²) time (which is optimal) and O(n) memory. I'm not actually 100% sure I got the O(n) memory part working perfectly, as it takes more compute time than I can handle to keep testing larger and larger problems. Part of the problem is that, after initial distance computations, most of the work is in data access and merging to create the dendrogram. So to minimize memory use, you should be using a sparse hash map of the indices remaining to their merged scores. But the amount of work is so minimal that computing the hash is a huge portion of the work, and using a dense array ends up being at least 2x faster. Finally, one of the items I've meant to add a while is feature importance from decision trees. There are 3 algorithms implemented and can be integrated easily with Random Forest. The out of bag magic of RFs are used with the feature importance to get a ranking of features that is really useful and takes into account non-linear relationships. A pretty rare and useful property!

I'm currently working on some more cluster related stuff for a research project. Another item I attempted to work on recently was this recent LargeViz paper. Unfortunately the gradient I derived does not seem to work and makes me wish I was better / more confident at math :(. I did implement an initial version of the Random Projection trees with neighbor exploration, though it needs more testing. I'm also not sure what parameters they used to get such high accuracies in Figure 2. I tested on MNIST and got a significant performance drop using the trees - and it was only 2 times faster (for some parameters) than my minimum variance VP tree. I still want to play with it more though, and I think I could apply the same idea to VP trees to get a 2x speedup. The problem with the RP tree is that its the line orthogonal to two random points. So you can either figure out which side you are on by doing 2 distance computations at each node, or by storing a dense vector representing the line explicitly. The former takes up twice as much compute time, and the latter takes up a huge amount of memory. Making a VP "forest" could get around that since you only need 1 random point and a radius.

Friday, November 20, 2015

A Binary Format for JSAT Datasets

I made a post a while ago about improving the LIBSVM file loader in JSAT so that it wouldn't use nearly as much memory and be a good deal faster too, and I complained about everyone using human readable ASCII file formats. Spurred by a recent pull request, I've finally gone ahead and implemented a simple binary format for storing datasets in JSAT. I'm not using Java's serialization for this, so it's a binary format that one could just as easily implement a reader/writer for in other languages as well.

The binary format supports both sparse and dense storage of numeric features, and stores the string names for categorical features. Since floating point values take up the majority of space, it also supports saving them in multiple different methods. Currently it can save values as a 32 or 64 bit float, as a short, or as a signed/unsigned byte. The default method is to scan through the dataset and check which of the options would result in the smallest file without losing any information. Despite this overhead, it's faster than writing either an ARFF or LIBSVM file! You can also explicitly pass the method you want to store it as, which will skip the overhead and do a lossy conversion if necessary.

I did a simple performance case for reading/writing the training set of MNIST. For the JSATData I tested with sparse and dense numeric features (determined by how the data is stored in memory) and using the Auto/ Unsigned byte, and 64 bit float options.

For this first table, I left the data normalized, so it was stored as integers from 0 to 255, making it easy for it to be saved as bytes. The JSAT writer for ARFF and LIBSVM writes out 0s as "0.0", so technically they have a some unnecessary padding. These numbers are from my Macbook Air which has a nice SSD.

Method	ARFF	LIBSVM	JSATData FP64 (sparse)	JSATData U_BYTE (sparse)	JSATData AUTO (sparse)	JSATData FP64	JSATData U_BYTE	JSATData AUTO
Read Time (ms)	7810	3777	989	758		2839	1735
Write Time (ms)	7091	3322	790	586	1652	1894	895	1580
File Size (MB)	203.7	87.5	108.9	45.6		377.1	47.4

In this next table, I normalized the values to a range of [0, 1]. This makes the JSAT code AUTO select FP64, and uses a lot more text in ARFF and LIBSVM. Since U_BYTE won't work any more, I also did a force as FP32.

Method	ARFF	LIBSVM	JSATData FP64 (sparse)	JSATData AUTO (sparse)	JSATDATA FP32 (sparse)	JSATData FP64	JSATData AUTO	JSATData FP32
Read Time (ms)	15247	6920	1032		941	2873		2765
Write Time (ms)	9445	5643	732	808	833	1794	1788	1933
File Size (MB)	318.3	202.1	108.9		72.7	377.1		188.7

You'll notice that the differences in write time didn't change so much for the second table between 64 bit and AUTO. This is because the code for auto detecting the best format will quit early once it has eliminated everything more efficient than a 64 bit float. As promised, the 64 bit format doesn't change in file size at all, which is a much more consistent and desirable behavior. And even when JSATData does not result in smaller file sizes, the format is simple making it much more IO bound, so it's much faster than the CPU bound ARFF and LIBSVM which have to do a bunch of string processing and math to convert the strings to floats.

I've also added a small feature for strings stored in the format, since I save out the names of categorical features and their options. There is a simple marker to indicate if strings are ASCII or UTF-16, that way for common ASCII strings not as much data is wasted. The writer will also auto-detect if ASCII is safe or it needs UTF-16.

I've written this format with JSAT's three main dataset types in mind, but hopefully this can be useful for others as well. If there is interest I may write a reader/writer for Python and C/C++ and host them up as small little projects on github.

Monday, October 26, 2015

Visualization Algorithms in JSAT

For the next update of JSAT (you can try them early in the 0.0.3-SNAPSHOT release), I've been working on implementing a few algorithms meant specifically for visualizing datasets. For now I've chosen three of the more common / well known algorithms, MDS, Isomap, and t-SNE. Unlike PCA, these algorithms are mostly meant to transform only one dataset at once - the intention being to visualize that block of data. You can't necessarily apply these algorithms to new data points, nor would you necessarily want to. Their purpose is really to help explore data and learn more.

A classic test problem is the "Swiss roll" dataset. This is a dataset where some of the data, such as the red and blue points below, are close based on the euclidean distance - but our intuition about the data is that they are really far away from each other in actuality. So a good 2D visualization of this data would place those data points far away from each other while keeping the red/orange close together.

t-SNE doesn't do a great job at visualizing the swiss roll data, but for more general datasets t-SNE is a great algorithm. The implementation in JSAT is the faster Barnes Hut approximation, so it runs in O(n log n) time, making it applicable to larger datasets. Below is a visualization of t-SNE on the MNIST dataset, which makes it easy to see 10 clusters, 1 for each datapoint. Looking at the relationships we can see stuff that makes sense. The 9 and 4 clusters are near each other, which makes sense - the noisy 9s and 4s could easily be misconstrued. We also see that some noisy 9s in the cluster of 7s, another understandable case.

t-SNE was a bit of work to implement, and I had difficulty replicating the results using the exact details in the paper. A few changes worked well for me, and had the benefit of being a bit easier to code. Right now my tSNE implementation is a slower single-threaded then the standard implementation, but mine does support multi-threaded execution.

Hopefully these new tools will be useful for people trying to get a better intuition of what's happening in their data. All three algorithms support multi-threaded execution and implement a new VisualizationTransform interface. The Isomap implementation also includes an extension call C-Isomap, and I'm hoping to add more useful visualization algorithms in the future.

On another note, I've moved the GUI components outside of JSAT completely. Anything GUI related will now be in a new project, JSATFX. As you might guess from the name, I'm going to be trying to do more of the code in JavaFX so I can learn it. The visualizations above are from the library.

Saturday, July 11, 2015

Easier Hyperparameter Tuning

I've just released the newest version of JSAT to my maven repo and github, and the biggest change is some work I've done to make parameter tuning much easier for people who are new to Machine Learning or JSAT specifically.

The crux of this new code, from the user's perspective, is a new method on the GridSearch object autoAddParameters. This method takes a DataSet object, and will automatically add hyper parameters with values to be tested. These values are adjusted to the dataset given, that way reasonable results can be given even if the dataset isn't scaled to a reasonable range.

The goal if this is to help new users and people new to ML. The issue of tuning an algorithm has been the most common feedback I receive from users, and part of the issue is that new users simply don't know what values are reasonable to try for the many algorithms in JSAT. Now with the code below, new users can get good results with any algorithm and dataset.

import java.io.File;
import java.io.IOException;
import java.util.List;
import jsat.classifiers.*;
import jsat.classifiers.svm.PlatSMO;
import jsat.classifiers.svm.SupportVectorLearner.CacheMode;
import jsat.distributions.kernels.RBFKernel;
import jsat.io.LIBSVMLoader;
import jsat.parameters.RandomSearch;

/**
 *
 * @author Edward Raff
 */
public class EasyParameterSearch
{
    public static void main(String[] args) throws IOException
    {
        ClassificationDataSet dataset = LIBSVMLoader.loadC(new File("diabetes.libsvm"));
        
        ///////First, the code someone new would use////////
        PlatSMO model = new PlatSMO(new RBFKernel());
        model.setCacheMode(CacheMode.FULL);//Small dataset, so we can do this
        
        ClassificationModelEvaluation cme = new ClassificationModelEvaluation(model, dataset);
        cme.evaluateCrossValidation(10);
        
        System.out.println("Error rate: " + cme.getErrorRate());
        
        /*
         * Now some easy code to tune the model. Because the parameter values
         * can be impacted by the dataset, we should split the data in to a train 
         * and test set to avoid overfitting. 
         */

        List<ClassificationDataSet> splits = dataset.randomSplit(0.75, 0.25);
        ClassificationDataSet train = splits.get(0);
        ClassificationDataSet test = splits.get(1);
        
        RandomSearch search = new RandomSearch((Classifier)model, 3);
        if(search.autoAddParameters(train) > 0)//this method adds parameters, and returns the number of parameters added
        {
            //that way we only do the search if there are any parameters to actually tune
            search.trainC(dataset);
            PlatSMO tunedModel = (PlatSMO) search.getTrainedClassifier();

            cme = new ClassificationModelEvaluation(tunedModel, train);
            cme.evaluateTestSet(test);
            System.out.println("Tuned Error rate: " + cme.getErrorRate());
        }
        else//otherwise we will just have to trust our original CV error rate
            System.out.println("This model doesn't seem to have any easy to tune parameters");
    }
}

The code above uses the new RandomSearch class, rather than GridSearch - but the code would look the same either way. RandomSearch is a good alternative compared to GridSearch, especially when more than 2 parameters are going to be searched over. Using the diabetes dataset, we end up with an output that looks like

Error rate: 0.3489583
Tuned Error rate: 0.265625

Yay, an improvement in the error rate! A potentially better error rate could be found by increasing the number of trials that RandomSearch performs. While the code that makes this possible is a little hairy, so far I'm happy with the way it works for the user.