Joint probability, conditional probability
and Bayes' theorem
For those of you who have taken a statistics course, or covered probability in another math course, this should be an easy review.
For the rest of you, we will introduce and define a couple of simple concepts, and a simple (but important!) formula that follows immediately from the definition of the concepts involved. The result is very widely applicable, and the few minutes you spend to become familiar with these ideas may be the most useful few minutes you spend all year!
Sex, Math and English
We'll start out by introducing a simple, concrete example, and defining "joint" and "conditional" probability in terms of that example.
Table 1 shows the number of male and female members of the standing faculty in the departments of Mathematics and English [as of the time this page was originally written  things have improved a bit since then]. We learn that the Math department has 1 woman and 37 men, while the English department has 17 women and 20 men. The two departments between them have 75 members, of which 18 are women and 57 are men.
Math 
English 
Total 

Female 
1 
17 
18 
Male 
37 
20 
57 
Total 
38 
37 
75 
Table 2 (below) shows the same information as proportions (of the total of 75 faculty in the two departments). If we wrote the name, sex and department affiliation of each of the 75 individuals on a pingpong ball, put all 75 balls in a big urn, shook it up, and chose a ball at random, these proportions would represent the probabilities of picking a female Math professor (about .013, or 13 times in a thousand tries), a female English professor (.227), a male Math professor (.493), and so on.
In formula form, we would write P(female, math) = .013, P(female, english) = .227, etc. These are called "joint probabilities"; thus P(female, english) is "the joint probability of female and english". Note that joint probabilities (like logical conjunctions) are symmetrical, so that P(english, female) means the same thing and P(female, english)  though often we chose a canonical order in which to write down such categories.
Table 2 represents the "joint distribution" of sex and department.
Math 
English 
Total 

Female 
.013 
.227 
.240 
Male 
.493 
.267 
.760 
Total 
.506 
.494 
1.00 
The bottom row and rightmost column in Table 2 give us the proportions in the single categories of sex and department: P(female) = .240, P(male) = .760, P(math) = .506, etc. As before, these proportions can also be seen as the probabilities of picking a ball of the designated category by random selection from our hypothetical urn.
N.B.: we've chosen this example because the relationship between sex
and academic discipline is concrete, simple, easy to remember  and highly
nonrandom  but not because we think it is appropriate or inevitable.
For information about efforts to improve the numbers of women mathematicians,
see the web page for the AWM; see
this page for an example
of a highly successful effort to improve the representation of women in
computer science at the undergraduate level.
Now suppose that someone chooses a ball at random from the faculty urn,
tells us that the department affiliation is "Math", and invites
us to guess the sex. We are then basically dealing with just the first
column of Table 1, represented in the nongreyedout portion of Table
3 below:
Math 
English 
Total 

Female 
1 
17

18

Male 
37 
20

57

Total 
38 
37

75

Since 37 out of the 38 Math professors are male, for a proportion of 37/38 or about .974, we could cite very good odds for guessing male: we'd be right about 974 times out of a thousand.
But Table 2 told us that P(male) is about .760. Why is the probability of male .974 now? Obviously, because the assumptions are different. With respect to the total set of 75 faculty in Math and English, the proportion of males is about .760; but with respect just to the 38 faculty in Math, the proportion of males is .974. We symbolize that "with respect to" using a vertical line (usually pronounced "given"), so that we write
P(male  math) = .974
which we read "the probability of male given math is .974". This is a conditional probability. Specifically, it is "the conditional probability of male given math".
Notice also that this is quite different from the "joint probability" P(male, math). And it is also different from the conditional probability P(math  male).
The values for these three quantities are (approximate numbers as always in this discussion):
P(male  math) = .974
P(male, math) = .493
P(math  male) = .649
If this isn't all obvious to you, spend a few minutes copying down the
tables and the formulae, and calculating values, until the interpretation
of such formula, at least in concrete cases like this one, is second nature
to you.
Now suppose that we don't have access to the original counts (as shown
in Tables 1 and 3), but only to the probabilities (i.e. the proportions
of balls of different sorts in the hypothetical urn), as shown in Table
2 above, or reproduced in Table 4 below, with associated probability formulae.
Math  English  Total  
Female 
P(female, math)
.013 
P(female, english)
.227 
P(female)
.240 
Male 
P(male, math)
.493 
P(male, english)
.267 
P(male)
.760 
Total 
P(math)
.506 
P(english)
.494 
1.00

Could we still calculate P(male  math)  that is, the probability that a randomly selected faculty member is male, if we know that he or she is in the math department?
Sure. The numbers in Table 4 tell us that 506 times out of a thousand, the chosen faculty member will be from the math department  and that 493 times out of a thousand, the chosen faculty members will be both male and from the math department.
Therefore, if we know that the prof is from math, the chances of maleness are 493/506, or about .974  just what it should be!
In formulaic terms,
P(male  math) = P(male, math) / P(math)  (eq. 1) 
Putting a bit more abstractly, for any values A and B (in a setup like the one we're talking about):
P(A  B) = P(A, B) / P(B) (eq. 2)
Plugging in all the other possible values for A and B, relative to our little faculty urn, we can get eight variants on equation 1:
P(male  math)  =  P(male, math) / P(math) 
P(female  math)  =  P(female, math) / P(math) 
P(male  english)  =  P(male, english) / P(english) 
P(female  english)  =  P(female, english) / P(english) 
P(math  male)  =  P(math, male) / P(male) 
P(math  female)  =  P(math, female) / P(female) 
P(english  male)  =  P(english, male) / P(male) 
P(english  female)  =  P(english, female) / P(female) 
If these relations are not obvious to you, try calculating (at least a few of them) from the probabilities given in Table 2 and the counts given in Table 1.
Bayes' Theorem
Now we're ready for Bayes' theorem, which has recently been called (in the pages of the Economist, no less) "the most important equation in the history of mathematics." This might be a little breathless  but you should definitely know it!
Since equation (2)  reproduced as (3a) below  involves arbitrary metavariables A and B, it's equally true if we swap them, producing equation (3b). And because joint probability is symmetrical, we can rewrite equation (3b) as (3c):
P(A  B) = P(A, B) / P(B)  (eq. 3a)  
P(B  A) = P(B, A) / P(A)  (eq. 3b)  
P(B  A) = P(A, B) / P(A)  (eq. 3c) 
Multiplying both sides of equation (3a) by P(B) gives use equation (4):
P(A  B) P(B) = P(A, B)  (eq. 4) 
And multiplying both sides of equation (3c) by P(A) gives us equation (5):
P(B  A) P(A) = P(A, B)  (eq. 5) 
Since the righthand sides of equations (4) and (5) are the same, we can equate their lefthand sides, giving us equation (6):
P(A  B) P(B) = P(B  A) P(A)  (eq. 6) 
And finally, we can divide both sides of equation (6) by P(B), giving us equation (7):
P(A  B) = P(B  A) P(A) / P(B)  (eq. 7) 
This is Bayes' Theorem! Sometimes it is called "Bayes' rule", perhaps because it follows so directly from the definitions involved that it seems hardly to count as a theorem.
Why this is a big deal
The usefulness of Bayes' theorem becomes clearer if we forget about simple tables of sex, academic discipline and the like, and think about the relationship between evidence and theory.
Suppose we have a set of alternative theories T_{1}, T_{2}, ..., and we've observed some evidence that bears on the choice among these theories, and we'd like to pick the theory that is more likely to be true given our observations. This leads us to want to define the conditional probability
P(T  E)
i.e. the probability of theory given evidence. Then if we could evaluate this quantity for every possible theory, we would've reduced our problem to the trivial matter of picking the maximum. Of course, the number of possible theories might be inconveniently large, and so we might have to look for a more efficient way to search for the maximum than exhaustive enumeration.
However, there is often a more fundamental problem, which is that we can't find a way to estimate the desired conditional probability, at least not directly. For example, when we are transmitting messages subject to various noise and distortion processes, it can be fairly easy to approximate these processes with generative models, and therefore to estimate how likely an output signal is given a particular choice of message; but such models typically don't allow us a direct estimate of how likely a particular message is given an observed signal. And in general, models for the synthesis of signals tend to be a lot easier to build than models for the analysis of signals. Luckily, we can apply Bayes' theorem to redefine what we want as
P(T  E) = P(E  T) P(T) / P(E)  (equation 8) 
In other words, if we want to know how probable a particular theory T_{i} is, given some particular evidence E, we can calculate how likely evidence E would be if we assume T_{i} to hold, multiply by the a priori probability of theory T_{i, }and divide by how likely we think evidence E is in and of itself. If all we care about is finding the most probable theory (which is normal), we can forget about the normalizing factor P(E), because it will be the same for all alternative theories.
As a result, the "best theory" (the theory with the largest posterior probability given the evidence) will be
ARGMAX_{i} P(E  T_{i}) P(T_{i)}
that is, the maximum value over all possible choices of subscript
i of the expression
P(E  T_{i}) P(T_{i).}
This way of thinking about things is very widely used in engineering approaches to pattern recognition.
In particular, equation (8), with theory replace by sentence and evidence replaced by sound, has been called "the fundamental equation of speech recognition."
The Bayesian framework is also a natural one for models of the computational problems of perception.
As Helmholtz pointed out a century and a half ago, what we perceive is our "best guess" given both sensory data and our prior experience. Bayes' rule shows us how to reconstruct this concept in formal mathematical and computational terms.