Linguistics 001: Homework 8

Linguistics 001 Homework 8 Due Mo 11/14/2005

This homework is intended to exercise your skills in linguistic analysis at various levels, in preparation for your final project. You can collaborate in small groups (three or fewer) on this assignment, but make sure that everyone involved understands how to do each part of the analysis.

(1) A link to a transcript of President George W. Bush's address nominating Samuel A. Alito as Supreme Court justice is here. That page has a link to streaming video, or you can download a .wav file here.

How many words are there in the transcript of Bush's nomination speech? How many words are there in the transcript of Alito's subsequent remarks? (Note that if you cut-and-paste the text into a program like Microsoft Word, you can use the built-in "word count" function (available in MSWord under the "tools" menu) to do the word count for you.)

(2) How many total seconds does Bush's speech take? How about Alito's remarks?

You should use a program for audio display and playback in order to make accurate measurements. Two relatively easy (and free) programs that you can use to do this are WaveSurfer and Audacity. You can also use the free Transcriber program, which makes is easy to enter a time-aligned transcription.

(3) Based on the total number of words and the total elapsed time, what is the overall speech rate (measured in Words Per Minute = WPM) for each speaker?

(4) On June 2, 2005, President Bush visited Hopkinsville, KY, to promote his plan for revising social security. The transcript of the session is here. Again, streaming video is available via a link on the whitehouse.gov page. A short audio excerpt is available as a .wav file here, corresponding to the stretch in the transcript from "The pay-as-you-go system is -- really isn't fair" to "That's not very far down the road."

How many words are there in this passage? How long is the corresponding audio? What is the speech rate in WPM for this stretch?

(5) Go back to Bush's nomination speech for Alito, and prepare a version of it that notes the duration of silent pauses between spoken phrases. You can use whatever format you prefer, but the content should be something like I've exemplified for the first few sentences of his speech:

Good morning. [1.429]
I'm pleased to announce my nomination of Judge Samuel A. Alito, Jr., [0.756]
as Associate Justice of the Supreme Court of the United States. [1.895]
Judge Alito is one of the most accomplished [0.622]
and respected judges in America, [1.869]
and his long career in public service has given him an extraordinary breadth [1.14]
of experience. [2.240]
As a Justice Department official, [0.843]
federal prosecutor [0.498] and judge on the United States Court of Appeals, [1.034]
Sam Alito has shown a mastery of the law, [1.101]
a deep commitment of justice, [2.573]
and a- [0.670]
man- and he is a man of enormous character. [1.338]
He's scholarly, fair-minded and principled, [1.195]
and these qualities will serve our nation well on the highest court [0.584]
of the land. [1.934]

You can use any program that allows you to make careful audio time measurements, but for this purpose I recommend Transcriber.

If you subtract the duration of the silent pauses from the elapsed time, what is the effective speech rate during what is left (i.e. the actually vocalized phrases of the speech)?

(6) Separate the pauses between sentences from the pauses within sentences (using the sentence divisions given by the official transcript). Plot a histogram (or other summary plot) of the within-sentence vs. across-sentence pauses. (You can create such plots using Microsoft Exel, or the free-software statistics program R (where the function is called hist()), or nearly any other statistical software package; or you can do it by hand.)

(7) Go to LDC Online, and sign up for a guest account. Once you are able to login as a guest, go to "LDC Corpus Search". Select "English Conversations" from the menu on the top left (which start out reading "English News Text". Set the "results view" to "tabular".

Now try the search string

"the" sex:male

The summary line should read:

Your search for "the" sex:male returned 54615 hits in 2478 documents .

If you try the search string

"the" sex:female

the results summary should be

Your search for "the" sex:female returned 43018 hits in 2388 documents .

This has searched the Switchboard corpus of around 2,400 telephone conversations.

A few facts about this corpus and the search system:

"Document" here means one side of a conversation, so that there are about 4,800 "documents".
Search strings match words separated by spaces, start and end of lines, and punctuation (including hyphens).
The assenting murmur sometimes written "mm hmm" is transcribed as "um-hum" in this corpus; the word sometimes written "OK" is (usually) transcribed "okay".
Each side of a conversation is searched as a separate sequence, so that the string that is returned may span (and omit) contributions on the other side of the conversation.
In order to get hit counts, you need to surround your search word in quotation marks, as in the examples given above.

Now search this corpus for the following, distinguishing male and female counts:

"okay"
"yes"
"yeah"
"right"
"uh-huh"
"um-hum"
"no"
"so"
"uh" (note that you will need to subtract the counts for "uh-huh")
"um" (note that you will need to subtract the counts for "um-hum")
"marvelous"

Briefly interpret what you find in terms of the ideas about language and gender sketched in the course lecture notes.

Extra Credit (do any subset of these or similar things):

Measure the duration of silent pauses within and between sentences in a longer stretch of one of George Bush's informal or conversational interactions, such as the Social Security conversation in Hopkinsville. (To do this, you'll need to make your own digital audio recording -- on most machines, you can use WaveSurfer or Audacity to make a .wav recording while playing the streaming media from the web. Instructions for doing this in Wavesurfer can be found here.) Then compare the pattern in formal, read speech (like the Alito nomination) with the pattern in the more informal, conversational material.
In your pause-marked transcription of the Alito nomination speech (or other material), can you find additional features (besides sentence boundaries) that seem to correlate with differences in pause duration? For example, you could look compare pauses preceding clausal conjunctions ("and his long career in public service has given him an extraordinary breadth of experience") vs. non-clausal conjunctions ("and respected judges in America").
See if you can find evidence in the Switchboard data that bears on the "tag question" problem discussed in the lecture notes. Note that to do this, you'll need to look at the whole transcript -- the transcript for each conversation can be accessed from the tabular "concordance" view by clicking on the Document ID at the left of the line.
Based on the Switchboard data, what is the difference in discourse function between the "filled pauses" transcribed as um and uh?

[course home page] [lecture schedule] [homework]