Comparing documents: TF-IDF

One possible application of document classification is information retrieval -- in that application, we define one class as "the documents we're looking for", and another class as "all other documents". We then to build one model from the query, or perhaps from some documents deemed relevant to the query, and another model from the whole collection. Variants of the "Naive Bayes" method have been quite successful in such applications. Some background reading: David Lewis, "Naive Bayes at forty: The independence assumption in information retrieval", ECML-1998.

But there's another old idea that leads off in an interestingly different direction. For a review, see Stephen Robertson and Karen Spärck Jones, "Simple, proven approaches to text retrieval", UCAM-CL-TR-356 1994.

This approach is based on a metric for the similarity of pairs of "documents", with queries treated simply as unusually short documents. Given one document taken as the basis, all the other documents in the collection can be ranked according to their similarity to this request. (And as we'll see later, pair-wise comparison of documents can also be used in unsupervised clustering methods.)

As with the Naive Bayes approach to document classification that we explored earlier, Robertson and Spärck Jones start with the premise that documents are "bags of words", i.e. multi-sets made up of terms identified as letter strings.

The original metaphor for this approach, suggested by Gerald Salton in the 1960s, was the so-called "vector space model", in which a document is treated as a vector of word counts, and (for example) a "cosine distance" between documents can then be created by starting with the inner product (sum of element-wise products) of such vectors.

(You may remember that the cosine of the angle between two vectors is their inner product multiplied by their (geometric) lengths. In "cosine distance" document-similarity metrics, the length-normalization is generally ignored, though as you'll see, there are various complexities added to try to get the sum of products of word-wise counts to Do The Right Thing...)

It's common to do "stemming" to collapse e.g. guess/guesses/guessing/guessed; and we can introduce multi-word "terms" like district attorney or Ivy League if we want; we can add synonyms or highly-associated terms to strengthen short documents; and so on. But for the purposes of this discussion, we'll just take "terms" to be alphabetic strings from the document.

In the treatment suggested by Robertson and Spärck Jones, each document's term counts are then weighted in the following way. Given

\(n\) is the number of documents term \(t(i)\) occurs in
\(N\) is the total number of documents in the collection

\(IDF(i) = log(N/n)\) is the "Inverse Document Frequency" weight of term \(t(i)\)

[Note that \(IDF(i)\) is 0 for a term \(t(i)\) that occurs in all documents]

\(TF(i,j)\) is the number of times term \(t(i)\) occurs in document \(d(j)\)

\(DL(j)\) is the length (in terms) of document \(d(j)\)

\(NDL(j) = DL(j) / MeanDL\)

     [where \(MeanDL\) is the average document length in the collection]

then the "Combined Weight" for term \(t(i)\) in document \(d(j)\) is

$$CW(i,j) = \frac{TF(i,j)*IDF(i)*(K_1+1)}{K_1*((1-b)+(b*(NDL(j))))+TF(i,j)}$$

This is a complicated version of the simple \(TF(i,j)*IDF(i)\) formula that gives this approach its name, "TF-IDF".

\(K_1\) and \(b\) are tuning parameters: \(K_1\) modulates the effect of term frequency, while \(b\) modulates the effect of document length. The authors suggest that values of \(K_1=2\) and \(b=0.75\) are a good starting guess for a particular application.

We are taking a single document as the "query", and checking its score against all the documents in a collection. Only terms shared between the query and a test document matter, and given that \(QF(i)\) is the frequency of term \(t(i)\) in the query document, the total score for a test document \(d(j)\) will be

$$\sum_i QF(i)*CW(i,j)$$

For some other ways to doctor up the inner product of word counts, see the Wikipedia article on tf-idf. The particular technique described above is known as "Okapi" or "BM25", for obscure historical reasons.