LING 001 -- Homework 6

(Due 10/28/2015)

1. Correcting incorrect parses, again

It's traditional for many LING001 students to have trouble with the basic concepts of syntactic analysis, and especially with interpreting "labelled bracketings" and "tree structures". This year's participants have upheld the tradition energetically. So this homework assignment is intended to lead you towards better understanding of these issues, through a set of simple examples followed by a simple problem.

We'll start with some exposition -- if you've already figured this stuff out, you can skip ahead to the actual questions.

If we give the word sequence "University of California" to the online Berkeley Parser, it tells us, plausibly enough, that the structure is

(ROOT
   (NP
     (NP (NNP University))
     (PP (IN of)
       (NP (NNP California)))))

or, in the equivalent graphical format that most people find easier to read,

This tells us that the whole phrase University of California is a Noun Phrase (symbolized "NP"), and that the subsequence of California is a Prepositional Phrase (symbolized "PP"), which in turn consists of a Preposition of (symbolized "IN"), and another NP which consists of nothing but the word California, which is a Proper Noun (symbolized "NNP").

This is a trivial example of the principle of compositionality discussed in the lecture notes: "Language is intricately structured, and linguistic messages are interpreted, layer by layer, in a way that depends on their structure". (Well, OK, there's nothing here yet about the layered interpretation -- but at least we've started to define the layers...)

The details of syntactic nomenclature in this example come from the conventions of the Penn Treebank, which specifies a set of Part-of-Speech Tags (labels for the lexical categories of individual words) and also a set of "non-terminal" (= higher-level) labels for things like Noun Phrase, Verb Phrase, Prepositional Phrase, Sentence, and so on.

The general outlines of this system are entirely traditional, with roots hundreds of years old. But the details involve some arbitrary choices -- thus there's a different tagset ("CLAWS") created at the University of Lancaster, which in some cases just uses different abbreviations (thus "NP1" instead of "NNP" for singular proper nouns, and "NP2" instead of "NNPS" for plural proper nouns), and in other cases makes a slightly different set of distinctions. And there are some similar semi-arbitrary choices to be made about what the tree structures should be in particular classes of cases.

The Penn Treebank conventions are the most widely-used set of ideas about such things, and also the best documented, both by means of instruction manuals and also by means of large-scale collections ("corpora") of parsed text -- which is what the word "treebank" means.

In this example, we could substitute Illinois, Washington, Cambridge, Pennsylvania, etc. in place of California, and the structure would remain otherwise the same. And if we turn all the parentheses to square brackets, and subsitute the results into an online tool like phpSyntaxTree, we get

[ROOT
   [NP
     [NP [NNP University]]
     [PP [IN of]
       [NP [NNP California]]]]]

...where the diagram obviously represents the same information, although it has added some colors (distinguishing the "terminal string" in red, i.e. the original word sequence, and the syntactic structure in blue), and some numerical subscripts (distinguishing the different instances of the same "pre-terminal" category label, e.g. NP1, NP2, NP3 ).

For another trivial example, consider "The history department", which turns out to be another Noun Phrase, this time made up of two Common Nouns (symbolized "NN") preceded by a Determiner:

(ROOT
  (NP (DT The) (NN history) (NN department)))

Again, it should be obvious that we could replace history with biology, sociology, music, mathematics, linguistics, etc., and get the same structure but with one word in the terminal string changed.

Note, though, that the Berkeley parser doesn't know that linguistics is a singular Common Noun ("NN") just like history, and instead guesses that since it ends in 's', it must be a plural form ("NNS"):

(ROOT
  (NP (DT The) (NNS linguistics) (NN department)))

We could use phpSyntaxTree again to create a diagram of the correct analysis, or the similar tool syntree:

[ROOT
     [NP [DT The] [NN linguistics] [NN department]]]

And for something a little less trivial, let's put both of our simple phrases together, to form a slightly more complex Noun Phrase "The history department of the University of California":

(ROOT
  (NP
    (NP (DT The) (NN history) (NN department))
    (PP (IN of)
      (NP
        (NP (DT the) (NNP University))
        (PP (IN of)
          (NP (NNP California)))))))

We've joined the two structures by

OK, that should be enough explanation for you to see how to answer Question 1 of the assignment, given below.

Consider the two sentences below, each followed by the analysis assigned to it by the online version of the Berkeley parser:

(A) The mayor of Philadelphia's name is Michael Nutter

(ROOT
  (S
    (NP
      (NP (DT The) (NN mayor))
      (PP (IN of)
        (NP
          (NP (NNP Philadelphia) (POS 's))
          (NN name))))
    (VP (VBZ is)
      (NP (NNP Michael) (NNP Nutter)))))

(B) The name of Philadelphia's mayor is Michael Nutter

(ROOT
  (S
    (NP
      (NP (DT The) (NN name))
      (PP (IN of)
        (NP
          (NP (NNP Philadelphia) (POS 's))
          (NN mayor))))
    (VP (VBZ is)
      (NP (NNP Michael) (NNP Nutter)))))

Question 1(a): Note that the parser has assigned essentially the same structure to the two sentences (A) and (B), changing only the words in the terminal string. In one case, this structure correctly represents the normal interpretation of the phrase, while in the other case, the parser got it wrong. Which one is right and which one is wrong?

Question 1(b): For the sentence that the parser analyzed incorrectly, what structure should it have assigned? Present your answer as a labelled bracketing and also as a tree diagram, using phpSyntaxTree or syntree or a similar application.

2. Analysis of Political Speeches

Five types of analysis are listed below. Apply two of them to one political speech, or one of them to political speeches by two different politicians. (You can do more if you have the interest and time -- and you might find the seed for a term project somewhere in this exercise.)

1. Use of metaphorical language. Try to relate a number of particular examples to a larger metaphorical scheme, perhaps in the general style discussed here and here.

2. Sentence length and depth of syntactic embedding. See here and here for some sample analyses in this style.

3. Use of disfluencies (filled pauses such as um and uh, or notable silent pauses; repetitions such as "I see no reason to believe we're headed for -- (pause) -- for economic downturn"; self-corrections such as "I mean, the free market is our -- one of our greatest assets"). At what rate do disfluencies occur? What is the effect on the candidate's message?

4. The dimensions of James Pennebaker's "LIWC" analysis. (Since this has not been covered in class, if you choose this option, you'll need to read at least the basic online description, and perhaps some of the papers written about it. Note that a free online version is available -- we are not asking you to buy the program.)

5. You can use free pre-packaged software like TextSTAT to do some word- or phrase-frequency calculations of your own, whose results you think are interesting. If you're a little more ambitious, and especially if you have some programming background, you might try NLTK. (As one trivial example, comparing the Oct. 9, 2007 Republican debate to the Sept. 26, 2007 Democratic debate, the Republicans used the word "no" at an overall rate of about 2.25 per thousand words, whereas the Democratic "no" rate was about 3.63 per thousand words.)

[Note that the on-line transcripts generally do not reproduce disfluencies accurately, so if you choose to do analysis type 3, you should correct the transcripts by reference to the recordings.]

Feel free to improvise on these general ideas, as long as the results are a) reasonably objective, factually correct, and well documented, and b) interesting.

Whatever analyses you do, try to use them to draw some conclusions about the politicians and their personalities, ideas, and rhetorical self-presentation.

You're free to choose any politicians, from any country, speaking in any language. Some perhaps-useful links are given below:

This site has links to many sources of (mostly historical) political speeches.

For recent speeches by President Obama, see the "Photos & Video" section at whitehouse.gov, which has .mp3 audio and transcripts as well as streaming and downloadable video.

GOP.gov has links to videos of many addresses, interviews and press conferences by Republica politicians, though generally without transcripts.

C-SPAN.org has links to video of many political speeches, events, and discussions of all kinds.

You can find links to transcripts and video or audio recordings in the Wikipedia pages for the "Republican presidential debates, 2008" or "Republican presidential debates, 2016", and the "Democratic presidential debates, 2008" or the "Democratic presidential debates, 2016".

You can find video, audio and transcripts for UK parliamentary activities here and here.

And you can often find audio, video, and/or transcripts of political events by doing web searches of more specific kinds, like {Donald Trump transcript mp3} or {Hillary Clinton transcript mp3}.