So there is a really great paper called Web-Scale K-Means Cluster by D. Sculley out of Google, and it introduces the Mini Batch k-Means algorithm. Its very clean and simple, something anyone could implement without even understanding it. However, there are a few very hand wavy parts of the paper that I don't like - so I decided I would investigate them quickly. For this investigation, I used k=100 clusters on the mnist data set. Like last time, I picked the initial means in a deterministic manner.
First, is the issue of parameter selection. All algorithms tend to have this problem, but in Mini Batch not only do we need to figure out how many \(k\) we need, but also how many iterations to perform, and the batch size. The paper is loose on details, and simply says "many were tired, its robust, here is the value we are using". Which is actually a little clever, because they avoid the iterations issue by just measuring how long it takes to get to a score relative to their benchmark.
When looking at this problem though, its important to note that the update weight for a cluster center is based on how many updates it has received, and is cluster center dependent. So if we do 50 iterations and have 100 clusters, some (well, most) of them wont get updated on each iteration. This is okay because of the per center learning rate. But this also, to a large degree, makes the two variables unnecessary Instead, we could look at total number of updates performed, where \(updates = iterations \cdot batch \ size\).
Clearly, updates is a very good measure of progress, there is a lot of variance in the results of mini batch. If I plotted the errors bars, you would see the standard deviation overlapping for all of them (and it would be very ugly and hard to read). So I haven't included them. Instead, it is the mean that was plotted and used for all of these. So does the trade off have any impact on the standard deviation of the results?
Turns out, still not really. Once we hit a batch size equal to the number of clusters, the standard deviation was pretty stable - and starts to plateau very quickly. So in some ways this is a good mark in Mini Batch's favor, you just have to make sure you pick a batch size and iterations that gives enough updates in total. So how many updates should we do? This is closely tied with my other issue, and that is speed.
The paper claims Mini Batch as producing a much better solution than previous approximations, yet still being 2 to 4 orders of magnitude faster than regular k-means. The problem I have wit this, as I alluded to in my last post, is that Loyd's algorithm was their base case. So, below I plotted the time it took to get to solutions of varying quality over time (stopping point was number of iterations).
The plot is fairly similar to what was presented in the original paper, but there are two big differences. One, we can see the different batch sizes (with the same number of iterations) and that they all tend to overlap based on the total number of updates. So again, its fairly resilient against that. However, the exact k-means has shifted over to the left by 1 to 3 orders of magnitude! Turns out Mini Batch is about an order of magnitude faster in getting to a comparable solution as Elkan's algorithm. That is still very impressive but falls far short of the claims in Sculley's paper in my view.
With this, I want to draw attention to the following graph - where we plot the number of iterations against the time it took the algorithm.
The first thing we can see is that all the batch sizes start at approximately the same time. This is because when we only do one iteration, the vast majority of the time is assigning all the data points to the closest final cluster afterwards. We can use this to choose a good initial batch size and iterations. If we make it so that the updates is equal to the number of data points, we do half our time updating centers and half our time assigning centers. This is about where Elkan starts on the plot with 1 iteration. Looking at our graph, that would get us close to the best we can expect from Mini batch. So that would be one not bad heuristic to use.
The other thing to notice, is the plateau of Elkan in the graph. As we get to the very end, Elkan's triangle inequality bounds are pruning out almost all the data set - so only a few points get updated in each iteration, which is pretty cool I think.
Edward,
ReplyDeleteThanks for introducing the mini batch k-means.
DC