LING052

Case study #2: Classifying news stories

Ideally we'd like to make decisions about the topic of a text by understanding its content. But you can tell a lot about a text just from the words that are used in it, regardless of how they're put together. Methods that rely only on the relative frequency of words in a text are commonly known as "bag of words" techniques.

We're going to look at (some subsets of) the New York Times Annotated Corpus, which contains over 1.8 million articles that appeared in the New York Times between 1987 and 2007. The manual for the collection is here.

We'll start with a simple problem: Is a particular story from the sports section or from the business section? We have plenty of training data, since each story in the NYT corpus is labelled with the "desk" originally responsible for its production.

There are 110,407,949 words in 20 years of Sports stories, and 73,570,588 words in 20 years of Business/Financial stories. So let's use the details of these two "bags of words" to try to classify a few stories from today's paper -- even though our training data is between 7 and 27 years out of date. And we might as well start with the simple "Naive Bayes" classification method that learned about earlier -- except that we'll use a slightly less silly set of features, namely words (in the sense of distinct letter-strings) rather than letters of the alphabet.

We'll start with Steve Eder, "Rodriguez Drops Suits Against Baseball and Union", 2/7/2014. The story's first sentence:

After a year of warring with Major League Baseball, Alex Rodriguez effectively ended his battle on Friday, dropping his lawsuits against baseball and the players union over his doping suspension.

Here are the counts, normalized frequencies, and log-likelihood-ratios relative to our two training sets (where a positive LLR favors Sports, and a negative one favors Business):

Word Sports #Per MW Business #Per MWLLR
after2944222666.71167361586.70.5
a295463826761.1182957724868.30.1
year3089922798.62252213061.3-0.1
of216152919577.7202911227580.5-0.3
warring730.7711.0-0.4
with8946058102.74451016050.00.3
major57052516.729929406.80.2
league2023421832.7290339.53.8
baseball79289718.1244933.33.1
alex721765.4166822.71.1
rodriguez11363102.93384.63.1
effectively224920.4352747.9-0.9
ended25982235.312975176.40.3
his8876228039.51792742436.81.2
battle789871.5678692.2-0.3
on8696017876.35283297181.30.1
friday37647341.018878256.60.3
dropping269724.4171423.30.0
his8876228039.51792742436.81.2
lawsuits4934.5549674.7-2.8
against1640291485.738932529.21.0
baseball79289718.1244933.33.1
and256341923217.7157348821387.50.1
the719675665183.3432025658722.60.1
players1813991643.0619284.23.0
union19730178.721951298.4-0.5
over1815541644.4943811282.90.2
his8876228039.51792742436.81.2
doping287326.0130.25.0
suspension821174.45317.22.3
TOTAL 26.1

Not every word points in the right direction -- and some, like "lawsuits", point strongly in the wrong direction -- but over the course of 30 words, we pick up a pretty strong "Sports" signal. And the whole story gives us an even clearer total sum of 239.5: given that this is the log of a product of probabilities, we can safely place a large bet on the Sports option.

In contrast, let's take a lot at a random story from today's Business section, "For Many Older Americans, an Entrepreneurial Path", 2/7/2014. The first sentence doesn't seem especially business-y (or especially sports-y either):

When Marilyn Arnold was 9 years old, her mother, a skilled seamstress, patiently taught her to sew on a vintage Singer treadle sewing machine.

And the bag-of-words judgment is likewise pretty even, though Business does come out ahead by a nose:

WordCount 1Per MWCount 2Per MWLLR
when3475913148.21249831698.80.6
marilyn1911.73594.9-1.0
arnold171115.5223930.4-0.7
was10253559287.03876425269.00.6
years1379401249.41094301487.4-0.2
old96591874.925631348.40.9
her1185651073.945881623.60.5
mother11044100.0364449.50.7
a295463826761.1182957724868.30.1
skilled9718.8103014.0-0.5
seamstress230.2290.4-0.6
patiently6065.5901.21.5
taught264023.9121316.50.4
her1185651073.945881623.60.5
to269684924426.2199778127154.6-0.1
sew590.5530.7-0.3
on8696017876.35283297181.30.1
a295463826761.1182957724868.30.1
vintage7026.45096.9-0.1
singer7016.3119816.3-0.9
treadle00.030.00.0
sewing540.52172.9-1.8
machine230020.8434459.0-1.0
TOTAL-1.252

But then we move past the story's feature-y beginning, and start encountering words like

business 17430 (157.9) 130878 (1778.9) -2.4
venture 1046 (9.4) 14422 (196.0) -3.0
accountant 282 (2.6) 1448 (19.7) -2.0
economy 1019 (9.2) 41642 (566.0) -4.1
recession 224 (2.0) 6419 (87.2) -3.8
entrepreneurial 73 (0.7) 1119 (15.2) -3.1

And over the course of the whole story, we end up with a score of -549.3, representing astronomical odds in favor of the right answer.

Neither of these stories were especially central or typical for their topic area, and our training data was many years removed from the test set, which is a serious problem for genres of news articles. But the method worked well all the same. When we have well-defined distinctions and a large body of training material, simple techniques of this kind are usually effective.

There are a lot of things that a "bag of words" doesn't tell us. The lexical histogram for a recipe gives us some clues about the ingredients list, but tells us very little about how to put them together. The lexical histogram for a set of driving directions tells us how many lefts and rights we need to take, but not in what order or onto what streets or highways. But if all we want to know if whether the document is a recipe or a set of driving directions, the word list is pretty good evidence.

And if we had little or no training data -- or if our training data were even less appropriate, say by being in a different language -- we'd have a different and more interesting set of problems to deal with. We'll look at cases of this kind later in the semester.