A Simple Approach to Binary Classification

The Basic Idea

We want to make a classification decision -- class 1 or class 2? -- based on some evidence \(x\).

The evidence is often ambiguous, and we'd like to be able to estimate its meaning from a sample of training data. So we pose the problem in terms of probabilities: pick the class \(c_i\) (i.e. \(c_1\) or \(c_2\)) so as to maximize the probability of the class \(c\) given the evidence \(x\): $$Class(x) = \left\{ \begin{array}{1 1} c_1 & \quad \text{if } P ( c_1|x) \gt P(c_2 | x)\\ c_2 & \quad \text{if }P(c_2|x) \gt P(c_1|x)\end{array} \right. $$

As we discussed earlier, it's usually easier to turn this around and estimate the probability of the evidence \(x\) given the class \(c\), and then to use Bayes' Rule to derive the quantity we want: $$P(c|x) = \frac{P(x|c)P(c)}{P(x)}$$

So we're going to choose class \(c_1\) if $$P(c_1|x) \gt P(c_2|x)$$ By Bayes' Rule, this is $$\frac{P(x|c_1)P(c_1)}{P(x)} \gt \frac{P(x|c_2)P(c_2)}{P(x)}$$ Multiplying both sides by \(P(x)\) this is $$P(x|c_1)P(c_1) \gt P(x|c_2)P(c_2)$$ And with a bit of algebra to separate the priors, we get $$\frac{P(x|c_1)}{P(x|c_2)} \gt \frac{P(c_2)}{P(c_1)}$$

If the two classes are equally likely, or at least if we don't know anything to make us expect one of them to be more probable than the other, this becomes $$\frac{P(x|c_1)}{P(x|c_2)} \gt 1$$ And taking the log of both sides makes our scores symmetrical around zero, and gives us what is known as a "log likelihood ratio" on the left-hand side: $$log(\frac{P(x|c_1)}{P(x|c_2)}) \gt 0$$

(Of course, if we expect one class to be more common than the other, never mind the evidence, this would add a positive or negative bias to the right-hand side of the inequality.)

How To Combine Multiple Pieces of Evidence?

Usually the "evidence" \(x\) is a combination of multiple features, maybe a very large number of them. We therefore represent \(x\) as a vector whose elements might have quantitative values (like the brightness of a pixel), or nominal values (like letters of the alphabet), or boolean values (like the presence or absence of a particular word).

If we take \(P(x|c)\) to refer to the joint distribution of all of the values of all of the elements of a vector \(x\), the problem of estimating \(P\) becomes very hard -- in fact it's usually intractable. So we make some simplifying assumptions. The most radical simplifying assumption is that all the elements of \(x\) are independent of one another.

This assumption is almost always false. It's so obviously a gross oversimplification that the evaluative adjective "naive" is appropriate. And since this method uses Bayes' Rule in order to turn estimates of \(P(x|c)\) into estimates of \(P(c|x)\), its usual name is "naive bayes". If you search Google Scholar for "naive bayes", you'll see that versions of this method are very widely used.

A Simple Example: Language Identification, One Letter At A Time

Let's explore these ideas in a concrete example. Our task is to classify a sample of text as either English or Italian. Our feature vector \(x\) will be the sequence of letters in the text. To make life easier, we'll turn all the letters to lower case, and ignore everything except for the letters a to z. And the "naive" independence assumption here is especially moronic -- we'll ignore the order of the letters, and just consider for each letter the estimated probability that it would arise in an English text vs. in an Italian text.

No one in their right mind would use this method of text classification. A plausible method would be to use dictionaries of word forms in different languages, or perhaps estimated probability of three- or four-letter sequences. (The combination of those features would probably still assume independence, however...)

But as we'll see, this moronic method works surprisingly well.

To get our feature vectors, we'll take a recent story from the New York Times:

A top aide to a Republican congressman from Arizona helped promote a legislative plan to overhaul the nation's home mortgage finance system. [...etc...]

Or one from La Stampa:

La presidente della Camera in tv commenta gli incidenti durante il voto del dl Imu-Bankitalia: «Queste cose si sono viste solo in dittatura, riflettiamoci» [...etc...]

After monocasing and so on, the first thousand characters of each text, represented as a column vector, are here and here.

As an estimate of the probability of letter given language, we'll use a set of counts from a few 2006 NYT stories and a set of counts from a few 2006 La Stampa stories, with 20,000 characters per sample.

At this point, let's remind ourselves of the benefits of Bayes' Rule. Suppose we have a table of \(M\) letter frequencies in \(N\) languages:

  a b c ... letter #M
English          
Italian          
Somali          
...          
Language #N          


In order to estimate \(P(letter|English)\), all we need to know is the row containing the English letter counts. In order to estimate \(P(English|letter\), we'd need to know the column containing the language counts for the letter in question. If our language universe is limited to English and Italian, we would have what we need; but in a more realistic situation, we might be adding new languages all the time, and also improving our models for old languages. (And if we are using more plausible features, like words or letter ngrams, we'll need to use complex smoothing techniques to avoid sampling error due to small counts in the typical Large Number of Rare Events situation...) Using the Bayes' Rule method, we can treat each language model by itself, and combine them in a simple and consistent way.

OK, let's go on with our example. The first letter of the first text to be classified is 'a'. Our probability estimates, and the resulting log likelihood ratio, are $$P(a|English) = 0.08675\\P(a|Italian) = 0.1175\\log(\frac{P(a|English)}{P(a|Italian)} = -0.30341$$

So far, Italian is winning, although in fact this text is the English-language NYT story!

However, if we persevere, the cumulative sum of log likelihood ratios gradually favors English more and more strongly, and after 1,000 letters (less than 200 words), it reaches a value of 91.028, which corresponds to odds of about \(10^{39}\) to 1 in favor of English.

Because we're treating the letters as independent, we'd get the same cumulative sum if we permuted the letters randomly (though the path to the sum would be a bit different...).

The analogous plot for the first thousand characters of the La Stampa story looks like this:

So despite the frankly moronic nature of our model for \(P(letter|language)\), things are working out pretty well!

And that's why techniques of this general kind are widely used -- they're simple and easy and often work a lot better than (you might think that) they deserve to.

There's some evidence that evolution learned this lesson long before statisticians and engineers did: See e.g. Paul Cisek, "Neurobiology: The currency of guessing", Nature 447:1061-1062, 2207; or Joshua Gold and Michael Shadlen, "Banburismus and the Brain", Neuron 36:299-308, 2002.

[For a slightly more technical introduction to "Naive Bayes" classifiers, see these lecture notes by Eamonn Keogh.]