Machine Learning with JSAT: January 2014

I've been busy getting used to my full time job, which has graciously given me permission to continue working on JSAT - so I haven't done much updating recently. I've been particularly neglectful of this blog. So to compensate I've decided to do a bit of an overview on implementing a special function. In particular, how I came to the digamma function that I added to JSAT in anticipation of some other items on my TODO list.

I've always enjoyed numerical analysis and implementing functions, though implementing them is never something that was really "taught" in my numerical analysis courses. We learn to analyze them and determine the potential for errors and all that, but how do we actually take a function and implement it? So for this I'll walk through what I did to create mine. By no means am I an expert at this - I usually prefer to follow papers of known good implementations, but is fun to do sometimes necessary.

Besides the wikipedia page, there are 2 other very useful websites when doing this kind of work. First is obviously WolframAlpha. The other is also from Wolfram, but is the less known functions.wolfram.com site. If you are approximating a well known function (such as the digamma function) that is listed there, they have a ton of properties and series already listed.

First thing is first, we plot the function we are interested in to get an idea of what its all about.

So, first thing we notice is that everything < 0 is clearly not a fun place to be. We would prefer to try and approximate areas that don't change much and don't change quickly. And if they do change, it would be nice if they changed in a consistent way. Occasionally we can get around such situations by looking for symmetry, a point where we can wrap a negative input value back into the positive space where things will be easier. This is where functions.wolfram comes into play.

If we pay attention, we realize that the left kind of looks like a $ tan(x) $ put through a transformation. So if we were to approximate it directly, I'd want to try and find a set of extra terms that would make $ tan(x)$ behave more like what we see - and then try to optimize those coefficients. But in this case we can use symmetry since (from wolfram)

$$ \psi(1-x) = \psi(x) + \frac{\pi}{tan(\pi x)} $$

Using that we can shift any negative value of $ x $ into the positive range.

While we are shifting values, we can also see that small positive values of $ x $ are rapidly approaching $ -\infty $. We again use another relation

$$ \psi(x) = \psi(x+1) - \frac{1}{x} $$

To shift away from the very small positive values. I decided to shift all values so that I only ever evaluate $ x \geq 2 $. This way I can focus on areas that appear to be much smoother and easier to deal with.

Now begins the approximation part. My preferred method is to use a continued fraction, as they often have an excellent radius of convergence. If the terms are fairly simple to compute (or event better, just coefficients) the continued fraction and also be very quick to compute. Unfortunately, WolframAlpha tells us continued fraction for the digamma involves evaluating the Riemann zeta function. So that would be too computationally expensive.

So we look for a series, and we note that the increase of $ \psi(x) $ is similar to that of $ \log(x) $. And indeed, we find the series expansion

$$ \psi(x) = \log(x) - \frac{1}{2 x} + \sum_{n = 1}^{\infty} \frac{\zeta(1-2 n)}{x^{2 n}} $$

Obviously we can't just keep summing until its accurate enough, because that would require us to evaluate the Riemann zeta function. We could either try and find the number of terms needed to be accurate for all $ x > 2$, or use a truncation of the series and try to correct for the error.

I chose the latter option, and truncated at

$$ \psi(x) \approx \log(x) - \frac{1}{2 x} - \frac{1}{12 x^2} $$

since this seemed to get me the most out of the diminishing returns in accuracy. In addition, I liked that it is actually a lower bound on $ \psi(x) $ for $ x > 0$. At this point, the first thing I did was try and find the point at which the difference was small enough that I could just accept the approximation as close enough. By $x = 500 $ the error was on the order of $ 10^{-15} $ or less, so that was my cut off.

That left me with the range $ x \in [2, 500] $ that I needed to approximate. Using some software to do MiniMax Approximations I tried to accurately approximate as much of the range as I could with a relatively small polynomial correction term. I started with the end of the range being 500 and decrease $ x $ as much as I could. This is because $ \psi(x) $ is behaving better and better as we increase x, so I want to get as much of the easy range as possible before having to do something more convoluted or clever. I was able to get $x \in [70, 500] $ with one polynomial and $x \in (7, 70) $ with a second.

This left me with only $x \in [2, 7] $ to get. I was aiming for high accuracy throughout all value ranges, any the code I was using for the MiniMax approximation wasn't working well in this range. I tried using some simpler tricks like a Pade approximation centered at 4.5, but the radius of convergence wasn't quite large enough (as is often the case).

At this point I started looking around for a bit of help, and found this paper. For small positive values (in a different range) they used

$$ \frac{\psi(x)}{x-x_0} $$

where $x_0 $ is the only positive value such that $ \psi(x) = 0 $. This turns out to be easier to approximate. While the paper notes having to do extra work due to numerical stability issues, because I used the trick in a larger than I did not need to apply any extra work.

All of that work combined got me my digamma function. While I've not done any performance tests, accuracy wise it seems to be very good ( usually less than $10^{-14}$ ) unless you value near one of the asymptotes.

You may feel that parts of this were a bit hand wavy. However, I feel most of the really difficult parts in implementing such functions are more about having the intuition behind the behavior. Being able to spot a pattern look for (or derive) a relation that will help you, or transform the function into a better behaved form.

Machine Learning with JSAT

Thursday, January 23, 2014

An Overview: Implementing the Digamma Function

About Me