Machine Learning with JSAT: A Binary Format for JSAT Datasets

I made a post a while ago about improving the LIBSVM file loader in JSAT so that it wouldn't use nearly as much memory and be a good deal faster too, and I complained about everyone using human readable ASCII file formats. Spurred by a recent pull request, I've finally gone ahead and implemented a simple binary format for storing datasets in JSAT. I'm not using Java's serialization for this, so it's a binary format that one could just as easily implement a reader/writer for in other languages as well.

The binary format supports both sparse and dense storage of numeric features, and stores the string names for categorical features. Since floating point values take up the majority of space, it also supports saving them in multiple different methods. Currently it can save values as a 32 or 64 bit float, as a short, or as a signed/unsigned byte. The default method is to scan through the dataset and check which of the options would result in the smallest file without losing any information. Despite this overhead, it's faster than writing either an ARFF or LIBSVM file! You can also explicitly pass the method you want to store it as, which will skip the overhead and do a lossy conversion if necessary.

I did a simple performance case for reading/writing the training set of MNIST. For the JSATData I tested with sparse and dense numeric features (determined by how the data is stored in memory) and using the Auto/ Unsigned byte, and 64 bit float options.

For this first table, I left the data normalized, so it was stored as integers from 0 to 255, making it easy for it to be saved as bytes. The JSAT writer for ARFF and LIBSVM writes out 0s as "0.0", so technically they have a some unnecessary padding. These numbers are from my Macbook Air which has a nice SSD.

Method	ARFF	LIBSVM	JSATData FP64 (sparse)	JSATData U_BYTE (sparse)	JSATData AUTO (sparse)	JSATData FP64	JSATData U_BYTE	JSATData AUTO
Read Time (ms)	7810	3777	989	758		2839	1735
Write Time (ms)	7091	3322	790	586	1652	1894	895	1580
File Size (MB)	203.7	87.5	108.9	45.6		377.1	47.4

In this next table, I normalized the values to a range of [0, 1]. This makes the JSAT code AUTO select FP64, and uses a lot more text in ARFF and LIBSVM. Since U_BYTE won't work any more, I also did a force as FP32.

Method	ARFF	LIBSVM	JSATData FP64 (sparse)	JSATData AUTO (sparse)	JSATDATA FP32 (sparse)	JSATData FP64	JSATData AUTO	JSATData FP32
Read Time (ms)	15247	6920	1032		941	2873		2765
Write Time (ms)	9445	5643	732	808	833	1794	1788	1933
File Size (MB)	318.3	202.1	108.9		72.7	377.1		188.7

You'll notice that the differences in write time didn't change so much for the second table between 64 bit and AUTO. This is because the code for auto detecting the best format will quit early once it has eliminated everything more efficient than a 64 bit float. As promised, the 64 bit format doesn't change in file size at all, which is a much more consistent and desirable behavior. And even when JSATData does not result in smaller file sizes, the format is simple making it much more IO bound, so it's much faster than the CPU bound ARFF and LIBSVM which have to do a bunch of string processing and math to convert the strings to floats.

I've also added a small feature for strings stored in the format, since I save out the names of categorical features and their options. There is a simple marker to indicate if strings are ASCII or UTF-16, that way for common ASCII strings not as much data is wasted. The writer will also auto-detect if ASCII is safe or it needs UTF-16.

I've written this format with JSAT's three main dataset types in mind, but hopefully this can be useful for others as well. If there is interest I may write a reader/writer for Python and C/C++ and host them up as small little projects on github.

Machine Learning with JSAT

Friday, November 20, 2015

A Binary Format for JSAT Datasets

No comments:

Post a Comment

About Me