The binary format supports both sparse and dense storage of numeric features, and stores the string names for categorical features. Since floating point values take up the majority of space, it also supports saving them in multiple different methods. Currently it can save values as a 32 or 64 bit float, as a short, or as a signed/unsigned byte. The default method is to scan through the dataset and check which of the options would result in the smallest file without losing any information. Despite this overhead, it's faster than writing either an ARFF or LIBSVM file! You can also explicitly pass the method you want to store it as, which will skip the overhead and do a lossy conversion if necessary.
I did a simple performance case for reading/writing the training set of MNIST. For the JSATData I tested with sparse and dense numeric features (determined by how the data is stored in memory) and using the Auto/ Unsigned byte, and 64 bit float options.
For this first table, I left the data normalized, so it was stored as integers from 0 to 255, making it easy for it to be saved as bytes. The JSAT writer for ARFF and LIBSVM writes out 0s as "0.0", so technically they have a some unnecessary padding. These numbers are from my Macbook Air which has a nice SSD.
Method | ARFF | LIBSVM | JSATData FP64 (sparse) | JSATData U_BYTE (sparse) | JSATData AUTO (sparse) | JSATData FP64 | JSATData U_BYTE | JSATData AUTO |
---|---|---|---|---|---|---|---|---|
Read Time (ms) | 7810 | 3777 | 989 | 758 | 2839 | 1735 | ||
Write Time (ms) | 7091 | 3322 | 790 | 586 | 1652 | 1894 | 895 | 1580 |
File Size (MB) | 203.7 | 87.5 | 108.9 | 45.6 | 377.1 | 47.4 |
In this next table, I normalized the values to a range of [0, 1]. This makes the JSAT code AUTO select FP64, and uses a lot more text in ARFF and LIBSVM. Since U_BYTE won't work any more, I also did a force as FP32.
Method | ARFF | LIBSVM | JSATData FP64 (sparse) | JSATData AUTO (sparse) | JSATDATA FP32 (sparse) | JSATData FP64 | JSATData AUTO | JSATData FP32 |
---|---|---|---|---|---|---|---|---|
Read Time (ms) | 15247 | 6920 | 1032 | 941 | 2873 | 2765 | ||
Write Time (ms) | 9445 | 5643 | 732 | 808 | 833 | 1794 | 1788 | 1933 |
File Size (MB) | 318.3 | 202.1 | 108.9 | 72.7 | 377.1 | 188.7 |
You'll notice that the differences in write time didn't change so much for the second table between 64 bit and AUTO. This is because the code for auto detecting the best format will quit early once it has eliminated everything more efficient than a 64 bit float. As promised, the 64 bit format doesn't change in file size at all, which is a much more consistent and desirable behavior. And even when JSATData does not result in smaller file sizes, the format is simple making it much more IO bound, so it's much faster than the CPU bound ARFF and LIBSVM which have to do a bunch of string processing and math to convert the strings to floats.
I've also added a small feature for strings stored in the format, since I save out the names of categorical features and their options. There is a simple marker to indicate if strings are ASCII or UTF-16, that way for common ASCII strings not as much data is wasted. The writer will also auto-detect if ASCII is safe or it needs UTF-16.
I've written this format with JSAT's three main dataset types in mind, but hopefully this can be useful for others as well. If there is interest I may write a reader/writer for Python and C/C++ and host them up as small little projects on github.