Friday, June 29, 2012

What is Machine Learning?

So what is Machine Learning? A prerequisite question if you plan on using it, and associated with such buzz words as "big data". I like to describe it as the intersection of Artificial Intelligence and Statistics. 


AI is trying to get computers to make smart decisions. This is different from what many people think, that AI wants to get computers to think like humans. No such goal really exists, because humans do stupid things. No banker wants an AI system buying thousands of lottery tickets as an investment. Statistics is the science of analyzing and interpreting data. 


So, ML is their union - we want to get computers to make smart decisions (or tell us something useful) by providing them data. Not just any data, but data about what we want to make a decision of. The data is in the form of a data set, which is made of many data points. But what is in a data point?


A data point can have categorial values. For example, a car can have either blue, red, black, or white paint. It data point can also have numerical values; a car's engin has some number of horsepower, and a certain liter size. All data points in a data set are consistent, they each have the same types of values in the same order. 


Many algorithms work especially well for data sets that have all numerical values. In this case, we often denote each data point \( x \) in a data set as a vector, \(\vec{x} = \{x_0, x_1, \ldots, x_{n-1} \}\), where \(n\) is the number of values that make up a data point. But data points can have any mix of numeric and categorical values, and you can convert between the two (at some cost).


The three ML problems I will be talking about are Classification, Regression, and Clustering. 


Classification is the task of assigning a category to an example. Spam filters are a classification problem. An email can be undesirable (spam), or desirable (called ham, because ham is better then spam!). In these problems we know all of the categories before we begin, and we provide prior examples of the problem where we know the answer. That means we first train our spam filter using lots of examples of spam emails and ham emails. We then use it after its been trained. 


Regression is similar to classification, but instead we have a numerical value we wish to predict. For example, a car manufacture might want to use the engine size, liter size, and car weight to predict the miles per gallon a car will get. 


Classification and Regression problems are very related, and you can actually convert algorithms that do classification into regression, and regression into classifiers. But this does not come for free, and has various trade offs. Many algorithms also support both intrinsically, and do not have to be re-written or converted. 


The last problem is Clustering. The first two problems are supervised problems - meaning a human has to provide input. Clustering is un-supervised - we provide the algorithm with no human provided labels. The goal is not to predict anything, but to discover something we dont know about the data. For example, a company might use clustering to discover groups of similar customers. They could then produce marketing aimed at a specific subset of their customers. 

Thursday, June 28, 2012

What on earth is this about?

So I'm starting a blog. Whats it about? It is about the Java Statistical Analysis Tool, a library I'm writing in my free time for fun and to teach myself. Its constantly evolving, and has lots of mistakes in design in it, and will eventually change several times. There are also several parts that I should have just used a pre-made library for, but did myself just for learning purposes. Its a bad habit, but all well.

But I'm hoping it will be helpful for people. Currently, I'm focusing it towards use in Machine Learning (ML), specifically focusing on Classification, Regression, and Clustering.

So what will I put here that you want to read? Hopefully everything, but unlikely. I'll be putting in a number of introductory and high level posts, intended for people who are technical in general but not in ML. I'll also post about algorithms and methods I have or am implementing, often in the context of using them in JSAT.