LING 001 -- Homework 5

(Due 10/31/2016)

1. Correcting incorrect parses

It's traditional for many LING001 students to have trouble with the basic concepts of syntactic analysis, and especially with interpreting "labelled bracketings" and "tree structures". So this homework assignment is intended to lead you towards better understanding of these issues, through a set of simple examples followed by some slightly less simple problems.

We'll start with some exposition -- if you've already figured this stuff out, you can skip ahead to the actual questions.

If we give the word sequence "University of California" to the online Berkeley Parser, it tells us, plausibly enough, that the structure is

(ROOT
   (NP
     (NP (NNP University))
     (PP (IN of)
       (NP (NNP California)))))

or, in the equivalent graphical format that most people find easier to read,

This tells us that the whole phrase University of California is a Noun Phrase (symbolized "NP"), and that the subsequence of California is a Prepositional Phrase (symbolized "PP"), which in turn consists of a Preposition of (symbolized "IN"), and another NP which consists of nothing but the word California, which is a Proper Noun (symbolized "NNP").

This is a simple example of a parse represented in terms of the Penn Treebank standard, which includes a set of tags for the "part of speech" of individual words, as well as a set of phrasal category labels.

And this three-element tree structure is a trivial example of the principle of compositionality discussed in the lecture notes: "Language is intricately structured, and linguistic messages are interpreted, layer by layer, in a way that depends on their structure". (Well, OK, there's nothing here yet about the layered interpretation -- but at least we've started to define the layers...)

The details of syntactic nomenclature in this example come from the conventions of the Penn Treebank, which specifies a set of Part-of-Speech Tags (labels for the lexical categories of individual words) and also a set of "non-terminal" (= higher-level) labels for things like Noun Phrase, Verb Phrase, Prepositional Phrase, Sentence, and so on.

The general outlines of this system are entirely traditional, with roots hundreds of years old. But the details involve some arbitrary choices -- thus there's a different tagset ("CLAWS") created at the University of Lancaster, which in some cases just uses different abbreviations (thus "NP1" instead of "NNP" for singular proper nouns, and "NP2" instead of "NNPS" for plural proper nouns), and in other cases makes a slightly different set of distinctions. And there are some similar semi-arbitrary choices to be made about what the tree structures should be in particular classes of cases.

The Penn Treebank conventions are the most widely-used set of ideas about such things, and also the best documented, both by means of instruction manuals and also by means of large-scale collections ("corpora") of parsed text -- which is what the word "treebank" means.

In this example, we could substitute Illinois, Washington, Cambridge, Pennsylvania, etc. in place of California, and the structure would remain otherwise the same. And if we turn all the parentheses to square brackets, and subsitute the results into an online tool like phpSyntaxTree, we get

[ROOT
   [NP
     [NP [NNP University]]
     [PP [IN of]
       [NP [NNP California]]]]]

...where the diagram obviously represents the same information, although it has added some colors (distinguishing the "terminal string" in red, i.e. the original word sequence, and the syntactic structure in blue), and some numerical subscripts (distinguishing the different instances of the same "pre-terminal" category label, e.g. NP1, NP2, NP3 ).

For another trivial example, consider "The history department", which turns out to be another Noun Phrase, this time made up of two Common Nouns (symbolized "NN") preceded by a Determiner:

(ROOT
  (NP (DT The) (NN history) (NN department)))

Again, it should be obvious that we could replace history with biology, sociology, music, mathematics, linguistics, etc., and get the same structure but with one word in the terminal string changed.

Note, though, that the Berkeley parser doesn't know that linguistics is a singular Common Noun ("NN") just like history, and instead guesses that since it ends in 's', it must be a plural form ("NNS"):

(ROOT
  (NP (DT The) (NNS linguistics) (NN department)))

We could use phpSyntaxTree again to create a diagram of the correct analysis, or the similar tool syntree:

[ROOT
     [NP [DT The] [NN linguistics] [NN department]]]

And for something a little less trivial, let's put both of our simple phrases together, to form a slightly more complex Noun Phrase "The history department of the University of California":

(ROOT
  (NP
    (NP (DT The) (NN history) (NN department))
    (PP (IN of)
      (NP
        (NP (DT the) (NNP University))
        (PP (IN of)
          (NP (NNP California)))))))

We've joined the two structures by

OK, that should be enough explanation for you to see how to do the actual assignment, given below.

1. Consider the two sentences below, each followed by the analysis assigned to it by the online version of the Berkeley parser:

(A) The mayor of Philadelphia's name is Jim Kenney

(ROOT
  (S
    (NP
      (NP (DT The) (NN mayor))
      (PP (IN of)
        (NP
          (NP (NNP Philadelphia) (POS 's))
          (NN name))))
    (VP (VBZ is)
      (NP (NNP Jim) (NNP Kenney)))))

(B) The name of Philadelphia's mayor is Jim Kenney

(ROOT
  (S
    (NP
      (NP (DT The) (NN name))
      (PP (IN of)
        (NP
          (NP (NNP Philadelphia) (POS 's))
          (NN mayor))))
    (VP (VBZ is)
      (NP (NNP Jim) (NNP Kenney)))))

Question 1(a): Note that the parser has assigned essentially the same structure to the two sentences (A) and (B), changing only the words in the terminal string. In one case, this structure correctly represents the normal interpretation of the phrase, while in the other case, the parser got it wrong. Which one is right and which one is wrong? Why?

Question 1(b): For the sentence that the parser analyzed incorrectly, what structure should it have assigned? Present your answer as a labelled bracketing and also as a tree diagram, using phpSyntaxTree or syntree or a similar application.

HINT: Check out the parser's output for "Kim and Leslie's answer was wrong."

 

2. The Berkeley parser and the Stanford parser both use a version of the Penn Treebank standard.

If we ask the Berkeley parser to analyze the sentence

What happened and who was involved remains unclear.

it gives us this result:

(ROOT
  (S
    (SBAR
      (WHNP (WP What))
      (S
        (VP (VBN happened)
          (CC and)
          (SBAR
            (WHNP (WP who))
            (S
              (VP (VBD was)
                (VP (VBN involved))))))))
    (VP (VBZ remains)
      (ADJP (JJ unclear)))
    (. .)))

which phpSyntaxTree graphs for us as:

Details aside, there's a basic problem that should be obvious to you once you get used to looked at such patterns. The subject of "remains unclear" should be two conjoined question-phrases "what happened" and "who was involved". But the parse shown above doesn't represent this -- instead it suggests that the first element of the and-conjunction is just "happened", and the initial "what" is the subject of the whole sequence "happened and who was involved".

The Stanford parser lucks out and gets the "scope of conjunction" right in this case:

(ROOT
  (S
    (SBAR
      (SBAR
        (WHNP (WP What))
        (S
          (VP (VBD happened))))
      (CC and)
      (SBAR
        (WHNP (WP who))
        (S
          (VP (VBD was)
            (VP (VBN involved))))))
    (VP (VBZ remains)
      (ADJP (JJ unclear)))
    (. .)))

Run each of the following five sentences though the two cited online parser demos. With respect to the basic structure, both of the results might be right, or both might be wrong, or one might be right and one wrong.

For each sentence, answer the following three questions:

Thus for the example above, your answer might be:

Stanford is right and Berkeley is wrong. Berkeley analyzes "and" as connecting "happened" and "who was involved", rather than "what happened" and "who was involved".

If both parsers had been wrong in the same way, then you might add

The correct basic structure is

[[[what happened] and [who was involved]] [remains unclear]]

(Note that if the parsers' answers are different, all the options are still there. One might be right and one might be wrong; or both are wrong -- though in different ways; or both might be right, because the sentence is ambiguous and could plausibly be analyzed either way.)

Here is the list of five sentences to check, adapted slightly from current news stories:

(a) The largest private-sector employer is the University of Pittsburgh Medical Center.
(b) The presidential candidates have sketched out two sharply conflicting views of the economy's health and the best ways to accelerate its growth.
(c) The Philippines suspended participation in any joint patrols with the U.S. of the South China Sea.
(d) UKIP has promised a full inquiry into an altercation involving one of its European lawmakers that left him in hospital and threw the party into chaos.
(e) Colombian President Juan Manuel Santos was awarded the Nobel Peace Prize on Friday for his efforts to end a half-century of civil conflict in his nation.