Monday, March 16, 2015

Improving dataset loading in JSAT/Java

I've had a loader for the LIBSVM format of storing data sets for a while now. I recently ran into a rather fun issue, where in loading a rather large data set I got an error. I didn't read the error, but the first thing I did was run it again with the debugger on - waiting for it to break on wherever the error was. And it ran through perfectly!

Turns out I was getting a GC Overhead limit exception. This happens when the JVM spends too much time doing GC work, and just kills your job. By running in debug mode, I slowed down how quickly new garbage was created, which gave the GC enough extra head room to make it through.

While amusing, I did need to fix this - and I encountered something surprising. There doesn't seem to be any good ware to turn a subset of a character sequence into a floating point value. The only method I could find was to run Double.parseDouble() on a sub string, but that meant creating a new garbage object! So I had to write that, and added code to read in the into a buffer, and walk through the buffer for the values I needed, re-filling when necessary.

Here is a graph of the memory use of loading a file with the original LIBSVM code. The first graph just shows memory use. The second shows GC overhead and the number of collections.



Not horrible, but this was on a small data set. On larger ones (like the 100GB file I encountered the error with) this is a problem. Below are the same graphs for the new code. 





That looks much better! It also looks more like what some people would expect had the code be written in C/C++ instead of Java, what was one of the reasons I wanted to show it. The GC in Java is a great tool, but that doesn't mean we should always rely on it. When necessary, there isn't much stopping us from writing Java code that looks like C/C++ and behaves very similarly. 

Current'y I'm not super happy with the new code, as I wrote it in a hurry at home so I could used the updated code at work. If I go back and re-write it again, I'll probably make a small finite state machine so that the logic is easier to follow. But it does perform much better! 

I also had to implement my own parsing of a character sequence to a double to get this performance, and this has the unfortunate side effect that you don't necessarily get the exact same double value you would have gotten if you used the standard library. This is something I need to fix, and an artifact of how surprisingly involved dealing with floating point always is. However the current code's relative error is always less than 10-14, which means the ML code will all be fine. But its still not a nice surprise. 

I also want to take a moment and complain about ASCII data sets. For most machine learning data sets, a common binary representation would take of less space, load faster, and be easier to handle. I don't understand everyones fascination with making things JSON or XML or whatever other text format they are using. Its almost always unneeded and added complexity for the sole benefit of being "human readable", even if a human isn't ever going to read it!

No comments:

Post a Comment