LING 001 -- Homework 5

(Due 10/20/2014)

1. Do exercises 2.1 and 2.2 from the end of chapter 2 of the Santorini & Kroch syntax textbook. We recommend that you read chapter 2 first -- it is relatively self-contained.

2. Correcting incorrect parses

The Stanford Parser, like many others of its type, turns word-strings like

Recent findings show that the number of students defaulting on student loans has decreased in recent years.

into parse trees, represented as labeled bracketing, like this:

    (NP (JJ Recent) (NNS findings))
    (VP (VBP show)
      (SBAR (IN that)
            (NP (DT the) (NN number))
            (PP (IN of)
                (NP (NNS students))
                (VP (VBG defaulting)
                  (PP (IN on)
                    (NP (NN student) (NNS loans)))))))
          (VP (VBZ has)
            (VP (VBN decreased)
              (PP (IN in)
                (NP (JJ recent) (NNS years))))))))
    (. .)))

In this representation, each pair of matching left- and right-parentheses, along with the label adjacent to the left-parenthesis, represents a constituent of the designated type:

(X ... )

Thus in the example given above, the string

"that the number of students defaulting on student loans has decreased in recent years"

has been analyzed as an "SBAR" (pronounced "ess bar"), which this handy list of "Nonterminals and Preterminals in the Penn Treebank II" will tell us stands for a "Clause introduced by a (possibly empty) subordinating conjunction".

And the word "that" is assigned the category "IN", which the same list identifies as a "Preposition or subordinating conjunction"; and the string "the number of students defaulting on student loans" is analyzed as an "NP" (= "Noun Phrase"); and so forth.

The full set of parsing principles is complicated -- the manual is 318 pages long -- but the basic idea is a simple one: complex phrases are created by putting simpler ones together.

Such parsers have become quite good -- and the Stanford parser is probably the best current open-source example -- but they can make mistakes in remarkably simple cases. Thus

The University of Pennsylvania freshman class.

comes out as

    (NP (DT The) (NNP University))
    (PP (IN of)
      (NP (NNP Pennsylvania) (NN freshman) (NN class)))
    (. .)))

which suggests that "Pennsylvania freshman class" is the name of the university.

By trying inputs like "We joined the Penn freshman class" or "We visited the University of Pennsylvania", which seem to work, you should be able to figure out what the correct parse for "The University of Pennsylvania freshman class" should look like. Or you could try a competitor, like the online Berkeley Parser demo, which happens to get this one right.

The crux of the matter, in this case, is that "University of Pennsylvania" and "freshman class" should be constituents, so that the main division (leaving out labels and other details) should not be

(University (of (Pennsylvania freshman class)))

but rather

(University of Pennsylvania) (freshman class))

For each of the following three examples from today's news, describe what is wrong with the Stanford parser's analysis. Then try to give a similarly schematic form of what the right analysis should be.

(1) The Supreme Court let stand appeals court rulings allowing same-sex marriage in five states.

(2) The nurse contracted the illness while treating a Spanish missionary who was infected in Sierra Leone and flown to a hospital in Madrid.

(3) Nick Foles threw two touchdown passes, the defense and special teams each scored and the Eagles held on for a 34-28 victory over the St. Louis Rams on Sunday.

Again, the Berkeley parser is worth comparing -- though there is no guarantee that either system will get any particular sentence right.

You may find it helpful to start with a more skeletal analysis, putting in only a few of the crucial labeled brackets to make sure that you have the basic structure correct. That is the key step -- adding the rest of the (more detailed) structure can come later.