LING 001 -- Homework 5

(Due 10/19/2015)

1. Do exercises 2.1 and 2.2 from the end of chapter 2 of the Santorini & Kroch syntax textbook. You should read chapter 2 first -- it is relatively self-contained.

2. Correcting incorrect parses

The Stanford Parser, like many others of its type, turns word-strings like

Recent findings show that the number of students defaulting on student loans has decreased in recent years.

into parse trees, represented as labeled bracketing, like this:

    (NP (JJ Recent) (NNS findings))
    (VP (VBP show)
      (SBAR (IN that)
            (NP (DT the) (NN number))
            (PP (IN of)
                (NP (NNS students))
                (VP (VBG defaulting)
                  (PP (IN on)
                    (NP (NN student) (NNS loans)))))))
          (VP (VBZ has)
            (VP (VBN decreased)
              (PP (IN in)
                (NP (JJ recent) (NNS years))))))))
    (. .)))

In this representation, each pair of matching left- and right-parentheses, along with the label adjacent to the left-parenthesis, represents a constituent of the designated type:

(X ... )

If you change all the parentheses to square brackets, you can put such labelled bracketings into an online app like phpsyntaxtree, and get a picture like this one.

In the example given above, the string

"that the number of students defaulting on student loans has decreased in recent years"

has been analyzed as an "SBAR" (pronounced "ess bar"), which this handy list of "Nonterminals and Preterminals in the Penn Treebank II" will tell us stands for a "Clause introduced by a (possibly empty) subordinating conjunction".

And the word "that" is assigned the category "IN", which the same list identifies as a "Preposition or subordinating conjunction"; and the string "the number of students defaulting on student loans" is analyzed as an "NP" (= "Noun Phrase"); and so forth.

The full set of parsing principles is complicated -- the manual is 318 pages long -- but the basic idea is a simple one: complex phrases are created by putting simpler ones together.

Such parsers have become quite good -- and the Stanford parser is probably the best current open-source example -- but they can make mistakes in remarkably simple cases. Thus

The University of Pennsylvania freshman class.

comes out as

    (NP (DT The) (NNP University))
    (PP (IN of)
      (NP (NNP Pennsylvania) (NN freshman) (NN class)))
    (. .)))

... or

This suggests that "Pennsylvania freshman class" is the name of the university.

By trying inputs like "We joined the Penn freshman class" or "We visited the University of Pennsylvania", which seem to work, you should be able to figure out what the correct parse for "The University of Pennsylvania freshman class" should look like. Or you could try a competitor, like the online Berkeley Parser demo, which happens to get this one right.

The crux of the matter, in this case, is that "University of Pennsylvania" and "freshman class" should be constituents, so that the main division (leaving out labels and other details) should not be

(University (of (Pennsylvania freshman class)))

but rather

(University of Pennsylvania) (freshman class))

For each of the following examples from a recent issue of the DP, describe what is wrong with the Stanford parser's analysis. Then try to give a similarly schematic form of what the right analysis should be.

(1) Quarterback Alek Torgersen suffered a head injury late in the first half of Penn’s loss to the Big Green at Franklin Field last weekend. 

(2) Liu said that students may see her chocolates at Houston Hall one day.

(3) It can be difficult to separate the true from the semi-true from the downright false.

Again, the Berkeley parser may be worth comparing -- though there is no guarantee that either system will get any particular sentence right.

You may find it helpful to start with a more skeletal analysis, putting in only a few of the crucial labeled brackets to make sure that you have the basic structure correct. That is the key step -- adding the rest of the (more detailed) structure can come later.