Machine Learning with JSAT: Implementing Fast ML Algorithms

In my previous post, I talked about some advice for learning how to implement Machine Learning algorithms based on some questions from Jason Brownlee. The one question I didn't address was how to implement such algorithms efficiently. So in those post I'll try to go over some of the more common / reusable strategies for making an implementation fast. I've kept this separate because writing a fast implementation is often somewhat orthogonal to writing an understandable one.

For some of the math below, $\vec{x}$ will denote a vector x, $\vec{x}_i$ will denote the i'th value of the vector. Well assume we have $N$ data points of dimension $D$

Tricks for Sparse Data:

1. Sparse updates to weight vectors

Most bigger datasets I see are generally sparse, meaning most of the values are zero. Especially in NLP scenarios. If we attempt to learn a linear model (a weight vector $\vec{w}$, we will often see that we have to perform some scalar update $\vec{w} \gets \lambda \cdot \vec{w}$. Naively done, this means we have to do $O(D)$ work per update, even if there are only $S$ non zero elements in the input, where $S << D$.

Instead we can use the representation $\vec{w} = c \cdot \vec{v}$. Then, when we multiply $\vec{w}$ by $\lambda$ we simply update $c \gets c \cdot \lambda $. When we add some $\vec{x}$ to $\vec{w}$, we instead do $\vec{v} \gets \vec{v} + \frac{1}{c} \vec{x}$. This way we only have to touch the values of $\vec{w}$ for which there was a non zero in $\vec{x}$. It turns out that most of the operations on a vector can be handled easily in this form

$||\vec{w}||_p = c ||\vec{v}||_p $
$\vec{w}^T \vec{x} = c \cdot \vec{v}^T \vec{x} $
The mean and standard deviation are also just scaled by $c$.
The skewness and kurtosis of $\vec{w}$ are the same as $\vec{v}$, we we don't have to do anything there.

The other part of this trick that we we can't just update $c$ over and over again. Most likely we are multiplying by small values $\lambda < 1$. If we keep accumulating these, $c$ will get smaller and smaller until it reaches a numerically unstable range, where $c \cdot \vec{v}$ will underflow to zero or just become too inaccurate. It can also caused the update $\vec{v} = \vec{v} + \frac{1}{c} \vec{x}$ to cause NaN to show up in your weight vector. So in practice, you need to check that $c_{low} < c < c_{high}$. I use When the bounds are violated, rescale by setting $\vec{v} = c \cdot \vec{v}$ and $c = 1$.

In JSAT, I've placed all of this functionality in a ScaledVector class so that this trick can be used transparently. It also makes it easier to develop a new algorithm. I start out with regular vector objects. Once the implementation is working, I just have to change the constructor to switch to a sparse friendly version & test again for correctness. On rare occasion I've had a bug come up or discovered one after switching, but normally its just an easy adjustment. A similar trick exists for maintain the norm of a vec $||w||$, though I've not needed that one as often.

2. Update your sparse vectors at once.

When working with vectors, its not uncommon to see some unit of logic or code that loops over some indices in the vector and adjusts them or sets them to some value. If you aren't using logic to update every value in one fell swoop, its probably a performance bug. To understand why, you have to consider how sparse vectors are normally implement.

A sparse vector would keep track of the number of non zeros (or used) values, and have a larger capacity value. The capacity would be the length of two different arrays. One keeps track of the non-zero values, and the other array contains the 'true' index associated with each non-zero value. If the capacity is exceeded, the internal vectors would have to be copied to larger arrays - similar to how one could implement a hash table.

So lets say you loop over every index, check if it is non zero, and then set $\vec{x_i} \gets \log\left(\vec{x}_i+1\right)$. You would be making two sets of issues.

First, for each value we touch, we have to loop through all $O(S)$ non-zero values in our array to find out if the index is a non-zero. So we are actually doing $O(S)$ work to read/touch a value instead of $O(1)$ like with a dense vector.

Second, we should have some mechanism to simply loop over the non-zeros to begin with, instead we are checking all values - doing $O(D)$ instead of $O(S)$ checks.

Combined, that means we are doing $O(D S)$ work instead of $O(S)$. This is one of the ares were just using a Vector abstraction would bite you if you weren't aware of what was actually going on behind the scenes. The library you use should provide special methods for doing such operations. For example, the object could have a getNonZeroValues method that iterates over the non-zero values, and the iterator could have a special set method that knows which index in the arrays it is at, so that it can be updated to a new value easily. Note, that if the non-zero values are stored in sorted order, the work would be $O(D \log(S))$. However this has other ramifications, and normally $S$ is very small, so the overhead of a binary search might even be more expensive than just hitting every value.

3. Be extra careful when using sparse matrices.

Even if your tooling supports sparse matrices, don't assume that they are efficient. Many matrix operations will destroy sparsity, and produce dense results. Common tools - such as the Cholesky decomposition, require you to re-organize the columns or rows to have a sparse output - and even then the results might not be great. For some decompositions we don't even really know how to produce good sparse results in general.

So before you go out using sparse matrices, read the documentation and make sure it supports it - make sure you call any needed pre-processing steps like column reordering, and start with smaller tests to make sure it really does work well enough for your situation.

Tricks for General Data

1. If possible, decompose your problem into exact steps where you can add or remove points in an arbitrary order.

Now, when I say this - I don't mean decompose the whole algorithm (though if you can - awesome!). I'm referring to individual steps. This is because many algorithms update some intermediate components / solutions, but don't necessarily change everything at each step.

For example, the k-means algorithm does most of its updates in the first few iterations. The vast majority of data-points will not change ownership on every iteration, and thats work we could avoid doing.

Say $m^{(k)}$ is one of the means being created. It may be that only 5% of the points that were closest to that mean moved to another one. For simplicity, lets say that that 5% were replaced by new datums that changed to being closest to $m^{(k)}$. If we represent $m^{(k)} = \frac{1}{c} \cdot \vec{v}^{(k)}$, we can do easy exact updates & removals from the mean. We set $c = $ the number of datums in that mean. Then if $\vec{v}^{(k)} = \sum_{i = 1}^{c} \vec{x}^i$, we only have to subtract the 5% that changed ownership (and add the ones that changed to this mean). In this scenario, we have used the first sparse trick to reduce our work in updating the means to just 10% of the original amount! And it doesn't matter if the vectors are sparse or dense, we are saving a lot of work!

In reality, this isn't a fixed savings - the savings will depend on the dataset and algorithm its being used in. Its not changing the overall complexity of the algorithm either - its entirely possible that all of the datums will change. But most of the time implementing tricks like this makes a huge savings when applicable. This trick probably takes the most work on the implementer's part, as its not a simple case of recognizing a common pattern. It may take more thought / understand to realize that a step can be decomposed in such a way. More generally this trick can be described as "avoid redundant work", but its a particular kind of strategy to avoid work.

2. Don't compute euclidean distances the naive way.

I've written about this trick before. It amounts to the fact that we can write the euclidean distance in a simper form.

$$d(\vec{x}, \vec{y}) = ||\vec{x}-\vec{y}||_2 = \sqrt{ \vec{x}^T \vec{x} + \vec{y}^T \vec{y} - 2 \vec{x}^T \vec{y}}$$

This form is simply faster to compute because you can re-use information. If $\vec{x}$ is a datapoint that will stay in our dataset, or be re-used several times, we can just save $ \vec{x}^T\vec{x}$ and re-use it when the time comes. It also makes sparse or mixed sparse/dense euclidean distances easier to implement.

3. The exp, pow, and log functions are expensive!

Many people don't realize how expensive these basic functions are to compute until they are shown explicitly how much time can be saved. They aren't always the bottleneck when computing, but it is often the case that they (and similar functions) are a significant portion of runtime.

Once you have your code working, its probably worth seeing which operations you can replace with faster and less accurate approximations. For example, Platt's scaling is a calibration technique to add probabilities to SVMs. However it spends almost all of its time just computing the $\log$ of inputs. I was able to shave off 30-40% of Platt's runtime in my library with 0 degradation in accuracy by using an approximation. If I wanted to really work at it, I could work on writing a $\log$ approximation just for that class to shave off more time.

This trick in particular requires you to do some careful testing. You need to figure out which parts of the algorithm will still work with approximate results, how much accuracy you can give up, and the range of accuracy. If your code relies on values near an asymptote of a function, its going to be much harder to get a fast approximation that is still good enough to use. It may also be that some parts of the code needs the more exact solution, but the approximation is fine elsewhere.

Tricks for Memory and Cache performance

The final set of tricks are probably the most important. In modern CPUs, access to memory is far slower than any other operation you might be doing. If you don't have any prior experience in finding or debugging these kinds of performance issues, this may be the most frustrating. However is the biggest differentiator, and is applicable to all programming.

Use mutable vector operations. A common thing I see in python code especially is to create new vectors on every update. Just changing a w = w + x to w += x can be a massive speedup, as allocating an copying to a new vector is going to dominate the work. Especially if x is sparse.
Make sure you know if your code uses row major or column major ordering for arrays / matrices. Then store and access everything in the appropriate way when possible.
Programmatically obtain the L2 and/or L3 cache size for your hardware. If you do this you now have powerful information for doing work in groups/batches. This is another way to get massive speed improvements when data is going to be re-used. This is often more applicable in "divide-and-conquer" style algorithms. Once the sub-set size gets to the order of the L2 or L3 cache, you can then switch to code adapted to the fact that it knows the data fits in cache. This a common tactic for matrix algorithms often known as "blocking".
Copying data can be ok. Some novices at optimizing for memory might mistake this as being a bad idea - and it can be. However, there may be circumstances were code might be accessed in one way at one part of an algorithm or another way in a later part. Its also possible that you simply got data int he wrong row/column major ordering. While some libraries will happily take the data and just run slower, its can be faster to instead copy the data into a new area in the desired ordering if the data is going to be re-used multiple times.

Machine Learning with JSAT

Sunday, November 2, 2014

Implementing Fast ML Algorithms

No comments:

Post a Comment

About Me