Joint probability, conditional probability
and Bayes' theorem

For those of you who have taken a statistics course, or covered probability in another math course, this should be an easy review.

For the rest of you, we will introduce and define a couple of simple concepts, and a simple (but important!) formula that follows immediately from the definition of the concepts involved. The result is very widely applicable, and the few minutes you spend to become familiar with these ideas may be the most useful few minutes you spend all year!

Sex, Math and English

We'll start out by introducing a simple, concrete example, and defining "joint" and "conditional" probability in terms of that example.

Table 1 shows the number of male and female members of the standing faculty in the departments of Mathematics and English. We learn that the Math department has 1 woman and 37 men,  while the English department has 17 women and 20 men. The two departments between them have 75 members, of which 18 are women and 57 are men.

Table 1

Math
English
Total
Female
1
17
18
Male
37
20
57
Total
38
37
75

Table 2 (below) shows the same information as proportions (of the total of 75 faculty in the two departments). If we wrote the name, sex and department affiliation of each of the 75 individuals on a ping-pong ball, put all 75 balls in a big urn, shook it up, and chose a ball at random, these proportions would represent the probabilities of picking a female Math professor (about .013, or 13 times in a thousand tries), a female English professor (.227), a male Math professor (.493), and so on.

In formula form, we would write P(female, math) = .013, P(female, english) = .227, etc. These are called "joint probabilities"; thus P(female, english) is "the joint probability of female and english". Note that joint probabilities (like logical conjunctions) are symmetrical, so that P(english, female) means the same thing and P(female, english) -- though often we chose a canonical order in which to write down such categories.

Table 2 represents the "joint distribution" of sex and department.

Table 2

Math
English
Total
Female
.013
.227
.240
Male
.493
.267
.760
Total
.506
.494
1.00

The bottom row and rightmost column in Table 2 give us the proportions in the single categories of sex and department: P(female) = .240, P(male) = .760, P(math) = .506, etc. As before, these proportions can also be seen as the probabilities of picking a ball of the designated category by random selection from our hypothetical urn.

N.B.: we've chosen this example because the relationship between sex and academic discipline is concrete, simple, easy to remember -- and highly non-random -- but not because we think it is appropriate or inevitable. For information about efforts to improve the numbers of women mathematicians, see the web page for the AWM; see this page for an example of a highly successful effort to improve the representation of women in computer science at the undergraduate level.

Now suppose that someone chooses a ball at random from the faculty urn, tells us that the department affiliation is "Math", and invites us to guess the sex. We are then basically dealing with just the first column of Table 1, represented in the non-greyed-out portion of Table 3 below:

Table 3

Math
English
Total
Female
1
17
18
Male
37
20
57
Total
38
37
75

Since 37 out of the 38 Math professors are male, for a proportion of 37/38 or about .974, we could cite very good odds for guessing male: we'd be right about 974 times out of a thousand.

But Table 2 told us that P(male) is about .760. Why is the probability of male .974 now? Obviously, because the assumptions are different. With respect to the total set of 75 faculty in Math and English, the proportion of males is about .760; but with respect just to the 38 faculty in Math, the proportion of males is .974. We symbolize that "with respect to" using a vertical line (usually pronounced "given"), so that we write

P(male | math) = .974

which we read "the probability of male given math is .974". This is a conditional probability. Specifically, it is "the conditional probability of male given math".

Notice also that this is quite different from the "joint probability" P(male, math). And it is also different from the conditional probability P(math | male).

The values for these three quantities are (approximate numbers as always in this discussion):

P(male | math) = .974
P(male, math) = .493
P(math | male) = .649

If this isn't all obvious to you, spend a few minutes copying down the tables and the formulae, and calculating values, until the interpretation of such formula, at least in concrete cases like this one, is second nature to you.

Now suppose that we don't have access to the original counts (as shown in Tables 1 and 3), but only to the probabilities (i.e. the proportions of balls of different sorts in the hypothetical urn), as shown in Table 2 above, or reproduced in Table 4 below, with associated probability formulae.

 

Table 4

Math English Total
Female
P(female, math)
.013
P(female, english)
.227
P(female)
.240
Male
P(male, math)
.493
P(male, english)
.267
P(male)
.760
Total
P(math)
.506
P(english)
.494
1.00

Could we still calculate P(male | math) -- that is, the probability that a randomly selected faculty member is male, if we know that he or she is in the math department?

Sure. The numbers in Table 4 tell us that 506 times out of a thousand, the chosen faculty member will be from the math department -- and that 493 times out of a thousand, the chosen faculty members will be both male and from the math department.

Therefore, if we know that the prof is from math, the chances of maleness are 493/506, or about .974 -- just what it should be!

In formulaic terms,

P(male | math) = P(male, math) / P(math)      (eq. 1)

Putting a bit more abstractly, for any values A and B (in a set-up like the one we're talking about):

P(A | B) = P(A, B) / P(B)      (eq. 2)

Plugging in all the other possible values for A and B, relative to our little faculty urn, we can get eight variants on equation 1:

P(male | math)   = P(male, math) / P(math)
P(female | math) = P(female, math) / P(math)
P(male | english) = P(male, english) / P(english)
P(female | english) = P(female, english) / P(english)
P(math | male) = P(math, male) / P(male)
P(math | female) = P(math, female) / P(female)
P(english | male) = P(english, male) / P(male)
P(english | female) = P(english, female) / P(female)

If these relations are not obvious to you, try calculating (at least a few of them) from the probabilities given in Table 2 and the counts given in Table 1.

Bayes' Theorem

Now we're ready for Bayes' theorem, which has recently been called (in the pages of the Economist, no less) "the most important equation in the history of mathematics." This might be a little breathless -- but you should definitely know it!

Since equation (2) -- reproduced as (3a) below -- involves arbitrary meta-variables A and B, it's equally true if we swap them, producing equation (3b). And because joint probability is symmetrical, we can re-write equation (3b) as (3c):

P(A | B) = P(A, B) / P(B)      (eq. 3a)
P(B | A) = P(B, A) / P(A)   (eq. 3b)
P(B | A) = P(A, B) / P(A)   (eq. 3c)

Multiplying both sides of equation (3a) by P(B) gives use equation (4):

P(A | B) P(B) = P(A, B)      (eq. 4)

And multiplying both sides of equation (3c) by P(A) gives us equation (5):

P(B | A) P(A) = P(A, B)      (eq. 5)

Since the right-hand sides of equations (4) and (5) are the same, we can equate their left-hand sides, giving us equation (6):

P(A | B) P(B) = P(B | A) P(A)      (eq. 6)

And finally, we can divide both sides of equation (6) by P(B), giving us equation (7):

P(A | B) = P(B | A) P(A) / P(B)      (eq. 7)

This is Bayes' Theorem! Sometimes it is called "Bayes' rule", perhaps because it follows so directly from the definitions involved that it seems hardly to count as a theorem.

Why this is a big deal

The usefulness of Bayes' theorem becomes clearer if we forget about simple tables of sex, academic discipline and the like, and think about the relationship between evidence and theory.

Suppose we have a set of alternative theories T1, T2, ..., and we've observed some evidence that bears on the choice among these theories, and we'd like to pick the theory that is more likely to be true given our observations. This leads us to want to define the conditional probability

P(T | E)

i.e. the probability of theory given evidence. Then if we could evaluate this quantity for every possible theory, we would've reduced our problem to the trivial matter of picking the maximum. Of course, the number of possible theories might be inconveniently large, and so we might have to look for a more efficient way to search for the maximum than exhaustive enumeration.

However, there is often a more fundamental problem, which is that we can't find a way to estimate the desired conditional probability, at least not directly. For example, when we are transmitting messages subject to various noise and distortion processes, it can be fairly easy to approximate these processes with generative models, and therefore to estimate how likely an output signal is given a particular choice of message; but such models typically don't allow us a direct estimate of how likely a particular message is given an observed signal. And in general, models for the synthesis of signals tend to be a lot easier to build than models for the analysis of signals. Luckily, we can apply Bayes' theorem to re-define what we want as

P(T | E) = P(E | T) P(T) / P(E)      (equation 8)

In other words, if we want to know how probable a particular theory Ti is, given some particular evidence E, we can calculate how likely evidence E would be if we assume Ti to hold, multiply by the a priori probability of theory Ti, and divide by how likely we think evidence E is in and of itself. If all we care about is finding the most probable theory (which is normal), we can forget about the normalizing factor P(E), because it will be the same for all alternative theories.

As a result, the "best theory" (the theory with the largest posterior probability given the evidence) will be

ARGMAXi   P(E | Ti) P(Ti)

that is, the maximum value over all possible choices of subscript i of the expression
P(E | Ti) P(Ti).

This way of thinking about things is very widely used in engineering approaches to pattern recognition.

In particular, equation (8), with theory replace by sentence and evidence replaced by sound, has been called "the fundamental equation of speech recognition."

The Bayesian framework is also a natural one for models of the computational problems of perception.

As Helmholtz pointed out a century and a half ago, what we perceive is our "best guess" given both sensory data and our prior experience. Bayes' rule shows us how to reconstruct this concept in formal mathematical and computational terms.