Sunday, July 1, 2012

What is a Probability Density Function, and what is it good for?

A lot of Machine Learning is about estimating the probability of classes. But first, how do we describe the probability of something? I'll talk about this in fairly general terms, and skipping over a lot of vigorous math. This is mostly because I want to introduce the concept, and introduce some hight level concepts.

So, how do we represent the probability of $$x$$ occurring?  We denote it $$P(x)$$, where the 'P' is for probability. The probability could be for something discrete ( where $$x$$ can have only one of a finite number of values), or continuous. Flipping a coin has only two possibilities - heads or tails [Or perhaps it has three: heads, tails, and landing on its side]. One would denote the probability of a coin flip being 'Heads' as $$P(x = Heads)$$.

While discrite distributions are useful, I find continuous probability distributions more interesting an useful for ML. As the name implies, continuous distributions represent the probability of $$x$$ having any value in a given range. Often, this range is in $$(-\infty , \infty)$$. So say we are trying to estimate the probability of a bright eyed young collage graduate getting a $60k job? We would write $$P(x = 60,000)$$. However, this does not make as much sense as we might think. We will get to why in a moment. I like to phrase a probability distribution by its probability density function, which we denote $$f(x)$$. So $$P(x) = f(x)$$. Seems simple, but, wouldn't we expect $$P(x = 60,000) \approx P(x = 60,001)$$. So say there is a 1% chance of getting a 60k job. By our intuition,$60,001 would also be around 1%, and so on. But say were incremented even smaller, by one cent. Do you see the problem? The sum of probabilities for just a range of a few dollars would be over 100%!

So really, you cant use  $$P(x)$$ when  $$x \in \Re$$, but you can use it to compare which is more likely, $$x$$ or $$x + \epsilon$$, even if its only going to be a small difference. Which is useful.

But back to the probability density function, what good is it for? Well, turns out -  we can reasonable ask for the probability that we get a job offer in the range of 55k and 65k, or $$P(55k \leq x \leq 65k)$$.

Because we have the function for the probability, and  by definition its integral must sum to one, we can write
$$P(a \leq x \leq b) = \int_{a}^{b} f(x)$$
Now this is meaningful! We can ask for the probability of getting a job offer in a certain range. This is also useful to provide meaningful information.

For example, say it takes $$Z$$ dollars a year to live a life where you can afford food, shelter, and medical care. We can then use our density function and find
$$P(-\infty \leq x \leq Z)$$
This means we want the probability of a job offer for $$x$$ being less then the amount of money an individual would need to afford basic needs.This is important information, and the probability of this happening is something we would want to minimize. Now that we can ask for it, we can target it and monitor changes in the value.

We can also alter this and ask: what is the mean job offer? We essentially need to average over all posible values, which can be written as
$$Mean = \int_{-\infty}^{\infty} x \cdot f(x)$$
We may also want to know the median, which is more resilient to outliers. In this case, we can solve
$$\int_{-\infty}^{Median} f(x) = \frac{1}{2}$$

This can also be useful in ML. Say we want to try and identify someone as a Computer Science graduate or a Philosophy graduate based on the value of their job offer. One way to do this, would be to compare their probabilities. If $$P_{CS}(x) > P_{P}(x)$$ then we would say its more likely they were a Computer Science major.

Another option, would be to take the means of each distribution, getting $$Mean_{CS}$$ and $$Mean_{P}$$. If a job offer was closer the the mean of a philosophy major, then we would say they were probably a Philosophy major. We would check which is smaller,  $$Mean_{CS}-x$$ or $$Mean_{P}-x$$. The same method could be done using the median instead.

So cool, now we can classify things using the probability density function $$f(x)$$! But at no point have I said how we find $$f(x)$$! Finding that function, or approximating it, is basically what a lot of Machine Learning is about, and well talk about that more later.

How we use the pdf depends on what our goals are. If computing $$P(x)$$ is expensive, the first method of just comparing probabilities wont work well if we have to do thousands of classifications every minute. If we used the second method, which only needs 2 subtractions, we know it will be really fast - but might not be as accurate. Its a trade off, which is another topic that will be discussed.