For the next update of JSAT (you can try them early in the 0.0.3-SNAPSHOT release), I've been working on implementing a few algorithms meant specifically for visualizing datasets. For now I've chosen three of the more common / well known algorithms, MDS, Isomap, and t-SNE. Unlike PCA, these algorithms are mostly meant to transform only one dataset at once - the intention being to visualize that block of data. You can't necessarily apply these algorithms to new data points, nor would you necessarily want to. Their purpose is really to help explore data and learn more.
A classic test problem is the "Swiss roll" dataset. This is a dataset where some of the data, such as the red and blue points below, are close based on the euclidean distance - but our intuition about the data is that they are really far away from each other in actuality. So a good 2D visualization of this data would place those data points far away from each other while keeping the red/orange close together.
t-SNE doesn't do a great job at visualizing the swiss roll data, but for more general datasets t-SNE is a great algorithm. The implementation in JSAT is the faster Barnes Hut approximation, so it runs in O(n log n) time, making it applicable to larger datasets. Below is a visualization of t-SNE on the MNIST dataset, which makes it easy to see 10 clusters, 1 for each datapoint. Looking at the relationships we can see stuff that makes sense. The 9 and 4 clusters are near each other, which makes sense - the noisy 9s and 4s could easily be misconstrued. We also see that some noisy 9s in the cluster of 7s, another understandable case.
t-SNE was a bit of work to implement, and I had difficulty replicating the results using the exact details in the paper. A few changes worked well for me, and had the benefit of being a bit easier to code. Right now my tSNE implementation is a slower single-threaded then the standard implementation, but mine does support multi-threaded execution.
Hopefully these new tools will be useful for people trying to get a better intuition of what's happening in their data. All three algorithms support multi-threaded execution and implement a new VisualizationTransform interface. The Isomap implementation also includes an extension call C-Isomap, and I'm hoping to add more useful visualization algorithms in the future.
On another note, I've moved the GUI components outside of JSAT completely. Anything GUI related will now be in a new project, JSATFX. As you might guess from the name, I'm going to be trying to do more of the code in JavaFX so I can learn it. The visualizations above are from the library.
No comments:
Post a Comment