Friday, November 20, 2015

A Binary Format for JSAT Datasets

I made a post a while ago about improving the LIBSVM file loader in JSAT so that it wouldn't use nearly as much memory and be a good deal faster too, and I complained about everyone using human readable ASCII file formats. Spurred by a recent pull request, I've finally gone ahead and implemented a simple binary format for storing datasets in JSAT. I'm not using Java's serialization for this, so it's a binary format that one could just as easily implement a reader/writer for in other languages as well.

The binary format supports both sparse and dense storage of numeric features, and stores the string names for categorical features. Since floating point values take up the majority of space, it also supports saving them in multiple different methods. Currently it can save values as a 32 or 64 bit float, as a short, or as a signed/unsigned byte. The default method is to scan through the dataset and check which of the options would result in the smallest file without losing any information. Despite this overhead, it's faster than writing either an ARFF or LIBSVM file! You can also explicitly pass the method you want to store it as, which will skip the overhead and do a lossy conversion if necessary.

I did a simple performance case for reading/writing the training set of MNIST. For the JSATData I tested with sparse and dense numeric features (determined by how the data is stored in memory) and using the Auto/ Unsigned byte, and 64 bit float options. 

For this first table, I left the data normalized, so it was stored as integers from 0 to 255, making it easy for it to be saved as bytes. The JSAT writer for ARFF and LIBSVM writes out 0s as "0.0", so technically they have a some unnecessary padding. These numbers are from my Macbook Air which has a nice SSD.

Method ARFF LIBSVM JSATData FP64 (sparse) JSATData U_BYTE (sparse) JSATData AUTO (sparse) JSATData FP64 JSATData U_BYTE JSATData AUTO
Read Time (ms) 7810 3777 989 758 2839 1735
Write Time (ms) 7091 3322 790 586 1652 1894 895 1580
File Size (MB) 203.7 87.5 108.9 45.6 377.1 47.4

In this next table, I normalized the values to a range of [0, 1]. This makes the JSAT code AUTO select FP64, and uses a lot more text in ARFF and LIBSVM. Since U_BYTE won't work any more, I also did a force as FP32.
Method ARFF LIBSVM JSATData FP64 (sparse) JSATData AUTO (sparse) JSATDATA FP32 (sparse) JSATData FP64 JSATData AUTO JSATData FP32
Read Time (ms) 15247 6920 1032 941 2873 2765
Write Time (ms) 9445 5643 732 808 833 1794 1788 1933
File Size (MB) 318.3 202.1 108.9 72.7 377.1 188.7

You'll notice that the differences in write time didn't change so much for the second table between 64 bit and AUTO. This is because the code for auto detecting the best format will quit early once it has eliminated everything more efficient than a 64 bit float. As promised, the 64 bit format doesn't change in file size at all, which is a much more consistent and desirable behavior. And even when JSATData does not result in smaller file sizes, the format is simple making it much more IO bound, so it's much faster than the CPU bound ARFF and LIBSVM which have to do a bunch of string processing and math to convert the strings to floats.

I've also added a small feature for strings stored in the format, since I save out the names of categorical features and their options. There is a simple marker to indicate if strings are ASCII or UTF-16, that way for common ASCII strings not as much data is wasted. The writer will also auto-detect if ASCII is safe or it needs UTF-16.

I've written this format with JSAT's three main dataset types in mind, but hopefully this can be useful for others as well. If there is interest I may write a reader/writer for Python and C/C++ and host them up as small little projects on github.