The accuracy of yes/no classification

We start with a problem that involves determining whether each of a large set of instances is -- or is not -- a member of some class. The answer should be more-or-less a matter of objective fact, although determining the fact of the matter may be expensive or may require waiting for an outcome. And the answer should be important to us.

The problem might be a question of medical diagnosis, like "this blood sample comes from someone with someone with breast cancer", which could be used for screening affected populations on a large scale, or for confirming a diagnosis suggested for other reasons.

Or the problem might be a matter of image analysis, like "these two pictures of faces are pictures of the same person", which could be used to index images for search, to tag friends in Facebook posts, or to monitor airports for wanted criminals. It might be a text-analysis question, like "this string of words expresses a negative opinion about a particular person, company, or product"), which could be used to estimate popular sentiment from Twitter traffic. Or it might be a deception-detection question: "Is this witness telling the truth?" A large range of real-world problems can be re-formulated as one or more yes/no questions of this type.

We also have a potentially promising method for solving our problem -- promising because it's cheap, or because it's fast, or because it offers a valuable prediction about things that otherwise won't be revealed until some time in the future. The method might be a chemical test, or a computer algorithm, or an expensive device, or a self-proclaimed psychic, or any other system that maps instances to answers.

Whatever the problem and whatever the proposed method, we want to determine how well the method works. What performance metric should we use?

This turns out to be a remarkably subtle and difficult question.

The answer depends very much on what our problem is like, and how we want to use the proposed method, and how we select a sample to test. And it's easy to make choices that are seriously misleading. People who are trying to sell a product may make such choices with the intent to mislead potential customers -- but sometimes we fool ourselves or others without any malicious intent.

For some examples of what NOT to do, read these weblog posts:

"Determining whether a lottery ticket will win, 99.999992849% of the time", 8/29/2004
"Steven D. Levitt: pwned by the base rate fallacy", 4/10/2008
"Linguistic Deception Detection: Part 1", 12/6/2011
"(Mis-) Interpreting medical tests", 3/10/2014
"When 90% is 32%", 3/18/2014

For additional background and terminology, read these lecture notes on "Signal Detection Theory".

Applications in pattern recognition generally use the terminology of "precision" and "recall", while biomedical applications generally use the terminology of "sensitivity" and "specificity". In both cases, the core concept is a 2x2 contingency table that looks like this:

Reality is Positive (P) Reality is Negative (N) 
Test is Positive True Positive (TP) False Positive (FP)
Test is Negative False Negative (FN) True Negative (TN)

Around this simple table, researchers have erected a complex and confusing scaffolding of derived terminology (figure from the Wikipedia articles linked above):

Even seasoned researchers can get confused about what particular terms in this elaborate bestiary of metrics actually mean. So if you memorize them all, you're be a source of amazement and admiration to all. More plausibly, you should remember that these terms -- precision, recall, sensitivity, specificity, prevalence, positive predictive value, negative predictive value, false discovery rate, false omission rate, diagnostic odds ratio, etc., etc., are all defined in terms of relations among cells in the basic 2x2 test-vs.-truth table.

And remember that if the cells of that 2x2 table contain only percentages, a crucial piece of information is lacking, namely the "base rate" or "prevalence". As a result, all the standard metrics are subject to the "base rate fallacy" -- and very sophisticated people, like the economist Steven Leavitt, may fall victim to this pernicious source of error.

One last (but important) thing: No overall evaluation of a test really makes sense unless we have a cost function, that is, a way of calculating the costs (and benefits) of various outcomes. Unfortunately, these costs and benefits are usually rather hard to quantity.