## Case study #2: Classifying news stories

Ideally we'd like to make decisions about the topic of a text by understanding its content. But you can tell a lot about a text just from the words that are used in it, regardless of how they're put together. Methods that rely only on the relative frequency of words in a text are commonly known as "bag of words" techniques.

We're going to look at (some subsets of) the New York Times Annotated Corpus, which contains over 1.8 million articles that appeared in the New York Times between 1987 and 2007. The manual for the collection is here.

We'll start with a simple problem: Is a particular story from the sports section or from the business section? We have plenty of training data, since each story in the NYT corpus is labelled with the "desk" originally responsible for its production.

There are 110,407,949 words in 20 years of Sports stories, and 73,570,588 words in 20 years of Business/Financial stories. So let's use the details of these two "bags of words" to try to classify a few stories from today's paper -- even though our training data is between 7 and 27 years out of date. And we might as well start with the simple "Naive Bayes" classification method that learned about earlier -- except that we'll use a slightly less silly set of features, namely words (in the sense of distinct letter-strings) rather than letters of the alphabet.

We'll start with Steve Eder, "Rodriguez Drops Suits Against Baseball and Union", 2/7/2014. The story's first sentence:

After a year of warring with Major League Baseball, Alex Rodriguez effectively ended his battle on Friday, dropping his lawsuits against baseball and the players union over his doping suspension.

Here are the counts, normalized frequencies, and log-likelihood-ratios relative to our two training sets (where a positive LLR favors Sports, and a negative one favors Business):

 Word Sports # Per MW Business # Per MW LLR after 294422 2666.7 116736 1586.7 0.5 a 2954638 26761.1 1829577 24868.3 0.1 year 308992 2798.6 225221 3061.3 -0.1 of 2161529 19577.7 2029112 27580.5 -0.3 warring 73 0.7 71 1.0 -0.4 with 894605 8102.7 445101 6050.0 0.3 major 57052 516.7 29929 406.8 0.2 league 202342 1832.7 2903 39.5 3.8 baseball 79289 718.1 2449 33.3 3.1 alex 7217 65.4 1668 22.7 1.1 rodriguez 11363 102.9 338 4.6 3.1 effectively 2249 20.4 3527 47.9 -0.9 ended 25982 235.3 12975 176.4 0.3 his 887622 8039.5 179274 2436.8 1.2 battle 7898 71.5 6786 92.2 -0.3 on 869601 7876.3 528329 7181.3 0.1 friday 37647 341.0 18878 256.6 0.3 dropping 2697 24.4 1714 23.3 0.0 his 887622 8039.5 179274 2436.8 1.2 lawsuits 493 4.5 5496 74.7 -2.8 against 164029 1485.7 38932 529.2 1.0 baseball 79289 718.1 2449 33.3 3.1 and 2563419 23217.7 1573488 21387.5 0.1 the 7196756 65183.3 4320256 58722.6 0.1 players 181399 1643.0 6192 84.2 3.0 union 19730 178.7 21951 298.4 -0.5 over 181554 1644.4 94381 1282.9 0.2 his 887622 8039.5 179274 2436.8 1.2 doping 2873 26.0 13 0.2 5.0 suspension 8211 74.4 531 7.2 2.3 TOTAL 26.1

Not every word points in the right direction -- and some, like "lawsuits", point strongly in the wrong direction -- but over the course of 30 words, we pick up a pretty strong "Sports" signal. And the whole story gives us an even clearer total sum of 239.5: given that this is the log of a product of probabilities, we can safely place a large bet on the Sports option.

In contrast, let's take a lot at a random story from today's Business section, "For Many Older Americans, an Entrepreneurial Path", 2/7/2014. The first sentence doesn't seem especially business-y (or especially sports-y either):

When Marilyn Arnold was 9 years old, her mother, a skilled seamstress, patiently taught her to sew on a vintage Singer treadle sewing machine.

And the bag-of-words judgment is likewise pretty even, though Business does come out ahead by a nose:

 Word Count 1 Per MW Count 2 Per MW LLR when 347591 3148.2 124983 1698.8 0.6 marilyn 191 1.7 359 4.9 -1.0 arnold 1711 15.5 2239 30.4 -0.7 was 1025355 9287.0 387642 5269.0 0.6 years 137940 1249.4 109430 1487.4 -0.2 old 96591 874.9 25631 348.4 0.9 her 118565 1073.9 45881 623.6 0.5 mother 11044 100.0 3644 49.5 0.7 a 2954638 26761.1 1829577 24868.3 0.1 skilled 971 8.8 1030 14.0 -0.5 seamstress 23 0.2 29 0.4 -0.6 patiently 606 5.5 90 1.2 1.5 taught 2640 23.9 1213 16.5 0.4 her 118565 1073.9 45881 623.6 0.5 to 2696849 24426.2 1997781 27154.6 -0.1 sew 59 0.5 53 0.7 -0.3 on 869601 7876.3 528329 7181.3 0.1 a 2954638 26761.1 1829577 24868.3 0.1 vintage 702 6.4 509 6.9 -0.1 singer 701 6.3 1198 16.3 -0.9 treadle 0 0.0 3 0.0 0.0 sewing 54 0.5 217 2.9 -1.8 machine 2300 20.8 4344 59.0 -1.0 TOTAL -1.252

But then we move past the story's feature-y beginning, and start encountering words like

business 17430 (157.9) 130878 (1778.9) -2.4
venture 1046 (9.4) 14422 (196.0) -3.0
accountant 282 (2.6) 1448 (19.7) -2.0
economy 1019 (9.2) 41642 (566.0) -4.1
recession 224 (2.0) 6419 (87.2) -3.8
entrepreneurial 73 (0.7) 1119 (15.2) -3.1

And over the course of the whole story, we end up with a score of -549.3, representing astronomical odds in favor of the right answer.

Neither of these stories were especially central or typical for their topic area, and our training data was many years removed from the test set, which is a serious problem for genres of news articles. But the method worked well all the same. When we have well-defined distinctions and a large body of training material, simple techniques of this kind are usually effective.

There are a lot of things that a "bag of words" doesn't tell us. The lexical histogram for a recipe gives us some clues about the ingredients list, but tells us very little about how to put them together. The lexical histogram for a set of driving directions tells us how many lefts and rights we need to take, but not in what order or onto what streets or highways. But if all we want to know if whether the document is a recipe or a set of driving directions, the word list is pretty good evidence.

And if we had little or no training data -- or if our training data were even less appropriate, say by being in a different language -- we'd have a different and more interesting set of problems to deal with. We'll look at cases of this kind later in the semester.