## Monday, May 6, 2013

### Approximate Maximal Margin Classification Algorithm

I've recently been adding a lot of online classifiers to JSAT, and was working on Approximate Maximal Margin Classification Algorithm (ALMAp):

Gentile, C. (2002). A New Approximate Maximal Margin Classification Algorithm. The Journal of Machine Learning Research, 2, 213–242.

The authors describe the algorithm once in very general terms - and then refine their specification as they go. Making it a little confusing to read (at least to me). It appears at first glance that ALMA has 5 parameters to set ($$p$$, $$q$$  $$\alpha$$  $$B$$  and $$C$$) ! But this is not the case. First is $$B$$ and $$C$$  which are set to default values that guarantee convergence ( $$B = \frac{1}{\alpha}$$, and $$C = \sqrt{2}$$ ).
There is also the scary mapping function which makes computation difficult and slow (one reason why SMIDAS in JSAT does not work well for large problems) - but we can fix the $$p$$ value to whatever we like, and $$p = 2$$ results in the map function becoming identity, which allows us to do simple updates. When we fix $$p = 2$$, this also fixes $$q = 2$$ , because $$q$$ is actually dependent upon $$p$$. This leaves only $$\alpha$$, which can be tested with only a few values from the paper.

The other sort of weirdness in the paper (I assume the authors realized it for their implementation, but they never mention it), is that the averaged output they use comes for (almost) free in the kernel case.

$$output_i(x) = \left(\sum_{k=1}^{m^{(i)}+1} c_k^{(i)} w_k^{(i)} \right) \cdot x$$

which can be just as easily written as

$$output_i(x) = \sum_{k=1}^{m^{(i)}+1} c_k^{(i)} \left( w_k^{(i)} \cdot x \right)$$

This second form makes far more sense when implementing the averages output of the kernelized version, because it means we don't have to figure out how to averages the kernel spaces.

Then in Remark 5, we get the recursive definition of the current hyperplane (and another recursive definition to get us $$N_{k+1}$$ )

$$w_{k+1} \cdot x = \frac{w_k \cdot x + \eta_k y_t \hat{x}_t \cdot x}{N_{k+1}}$$

The authors do mention that this gets us the kernel form since it is now expressing the hyperplane in dot products, they don't mention that this gets us the averaged (or voted) output result for free! We simply have to tally the averaged result as we compute every $$w_k \cdot x$$ in the recurrence relation. Thus saving us from having to keep any copies (linear case you have to keep and extra $${w_k}_{avg}$$ around).

The paper defines $$\hat{x}_t = \frac{x_t}{\sqrt{x_t \cdot x_t}}$$, which gives the wrong impression that you would need to store a modified copy of the input vector - instead you can write

$$w_{k+1} \cdot x = \frac{w_k \cdot x + \eta_k y_t \frac{1}{\sqrt{x_t \cdot x_t}} x_t \cdot x}{N_{k+1}}$$

and then cache the self dot products for the future.

Overall its a good paper and nice algorithm, though I'm not sure if I want to implement the generalized $$p$$ version for the linear case. You can find the linear and kernelized version of ALMA2 in JSAT since revision 692.