Linguists involved in human experimentation or computer simulation are likely familiar with the use of accuracy statistics to summarize the results of a classification task (either by humans or computers). While accuracy (that is, the number of correct responses divided by the number of incorrect responses) is a useful statistic, both psychologists and engineers studying information retrival have developed simple, intuitive statistics for summarizing classification.
Conside the four types of outcomes observed in a classification context (Neyman and Pearson 1928):
| true positive (tp) | false positive (fp) |
| false negative (fn) | true negative (tn) |
It is worthwhile to spend some time understanding this table, which is sometimes called a confusion matrix. In each cell, the first word, refers to whether the response was correct (true) or incorrect (false). The second word of each cell refers to the response itself; what a positive or a negative response depends upon the specification of the task.
Let us imagine a study using in which subjects are asked whether or not two words are homonyms; we will say that the "different" response is a positive, whereas "same" is negative. It is common to include fillers in this design to assess whether speakers understand the task; for instance, Johnson (2007: chapter 3) asks subjects whether pause and paws, which are homonyms in every North American dialect of English, are pronounced the same or different. Intuitively, a subject who falsely responds "different" to homonyms reflects a subject bias favoring classifying stimuli as "different"; this outcome is a false positive (or Type I error). We can also imagine a different scenario, in which we ask speakers whether beat and bet, which are pronounced differently in all North American dialects of English, are the same or different. A subject who classifies this stimulus as "same" has produced a false negative (or Type II error), false since the response is a priori incorrect, and negative simply since we have defined the "different" response as positive. In other terminology you may be familiar with, the "different" response is called a hit.
When trying to assess both the discrimination ability of a subject (or a model), we count the individual values in each of these cells. The simplest measure, one which most are familiar with, is accuracy, simply a measure of how close the subject is to the true outcomes:
However, in an highly-biased experiment, where one response is much more frequent than the other, it is trivially easy to get a relatively high accuracy score by simply guessing the most frequent classification on every trial. For instance, if 90% of the stimuli are actually "different", a subject (or computer model) that guesses "different" every time will have 90% accuracy! For this reason, the statistics called precision and recall (the latter also known as hit rate) are used to tease apart the subject's bias:

These statistics, which range between 0 and 1, have multiple interpretations, so let us simply focus first on how one could trivially obtain a high precision or recall. If 60% of the answers are "different", the subject who guesses "different" every time will achieve a precision equal to 0.6 and perfect recall. Precision could be improved by replacing the 40% false positives with true negatives. Recall would go down if any false negatives occur.
Precision and recall can also be interpreted as conditional probabilities:
Instead of using accuracy as a measure of subject's actual discrimination, it is common to use the harmonic mean of precision and recall, a statistica called the F-measure or the F1 score. Like the more-familiar "arithemetic" mean, the harmonic mean of two values is greater than the smaller of the two values and less than the larger of the two values, but is always closer to the smaller of the two (and thefore, also falls between 0 and 1). If 60% of the answers are "different", the subject who guesses "different" every time achieves an F1 of 0.75. This corresponds to a reasonable baseline for discrimination in this task.
The final piece of the puzzle is provided by the ability to directly assess subjects' discrimination and bias on a narrow scale. Grier (1970), and later Donaldson (1992) developed simple statistics to measure discrimination and bias using only the counts in each cell of the Neyman-Pearson square above. First, we compute recall (here written HR, short for hit rate), as well as a new statistic known as false alarm rate, written FAR:
These two values are used to compute A', a statistic measuring the degree of discrimination:
this value ranges between 0 and 1, where 1 indicates perfect classification. Finally, using the same measures, we can compute the B' measure.
This statistic ranges between -1 and 1, where 0 corresponds to "no bias". A speaker who said "different" every time has a complete bias and a B' of -1, whereas a speaker who said "same" every time would have a B' of 1.
Some readers may be familiar with similar d' and beta statistics for measuring discrimination and bias (e.g, see this tutorial by Pat Keating at UCLA, or Macmillan and Creelman 2005). Neither d'/beta analysis (and the associated detection theory) nor A'/B' analysis (and the associationed choice theory) are assumption-free. However, there are clearly more assumptions associated with d'/beta analysis (to wit: under "detection" theory, both hit rate and false alarm rate are assumed to be normally distributed with variance equal to 1) than in the A'/B' analysis techniques, and as Donaldson (1993) shows, the A'/B' methods are more robust to violations of assumptions than is d'/beta analysis. For this reason, I consider d'/beta analysis to be deprecated in favor of A'/B' analysis, and at the very least, used only when these assumptions are backed up by tests for normality and comparing the estimated variance for hit rate and false alarm rate to the assumed 1.
For a real-world example of A'/B' analysis, see my paper (with Catherine Lai and others) Perception of disfluency: Language differences and listener bias. This file has my implementations of A' and B' (and d'; I'd advise just reporting them all!) functions in R.