Linguistics 520: Diving into R

1. Getting Started

Download http://babel.ling.upenn.edu/courses/ling520/SwitchboardDurations to a folder where R will find it (remember setwd()?). You can do this in R via

download.file("http://ling.upenn.edu/courses/ling520/SwitchboardDurations.rda","SwitchboardDurations.rda")

Then in R, execute

load("SwitchboardDurations.rda")

This will give you several variables, crucially WordDurs, Call2Caller, and CallerInfo, created from the Mississippi State revisions of the aligned Switchboard transcripts. In a later exercise, we'll look at the process of generating this R-accessible form of the data.

WordDurs has 4,051,206 rows and 8 columns, like this:

CallPos  Call PhrasePos BPhrasePos PLength Duration      Word     Nonlex
1       2001a     0          0       0     0.977625   [silence]      1
2       2001a     0          0       0     0.237625   [silence]      1
3       2001a     1          1       1     0.509375          hi      0
4       2001a     0          0       0     0.549000   [silence]      1
5       2001a     1          1       1     0.654000          um      0
6       2001a     0          0       0     0.293875   [silence]      1
7 2001a 1 6 6 0.440250 yeah 0
8 2001a 2 5 6 0.295875 i'd 0
9 2001a 3 4 6 0.150000 like 0
10 2001a 4 3 6 0.160000 to 0
11 2001a 5 2 6 0.260000 talk 0

The columns are:

CallPos -- position in the current side of the current call
Call -- ID of call (number) and side (a or b)
PhrasePos -- Count from the start of the phrase, where "phrase" is "sequence between silences"
BPhrasePos -- Count from the end of the phrase
PLength -- Length of the current phrase
Duration -- in seconds
Word -- The current word, in standard English spelling, or [silence], [noise], [laughter] etc.
NonLex -- 1 if "Word" is [silence] or [noise] or other non-speech material, 0 for "real speech"

This allows you to execute some lines like this:

whichum = (WordDurs[,"Word"] == "um")
whichuh = (WordDurs[,"Word"] == "uh")
meanUM = mean(WordDurs[whichum,"Duration"])
meanUH = mean(WordDurs[whichuh, "Duration"])
cat(sprintf("Mean UM Duration %.3f (of %d)\nMean UH Duration %.3f (of %d)\n", meanUM, sum(whichum),meanUH, sum(whichuh)))

...giving you this result:

Mean UM Duration 0.431 (of 21187)
Mean UH Duration 0.306 (of 69814)

Call2Caller has 4876 rows like this:

          CallID CallerID
   4630b  4630b     1690
   4718a  4718a     1690
   4825b  4825b     1690
   4812b  4812b     1691
   4629b  4629b     1650
   4703b  4703b     1650
   4618b  4618b     1651
   4655b  4655b     1651
   4216a  4216a     1610
   4241a  4241a     1610

CallerInfo has 543 rows like this:

       ID Sex Age Calls        Region
1000 1000 F 37 1 SOUTH_MIDLAND
1001 1001 M 51 3 WESTERN
1002 1002 F 28 2 SOUTHERN
1003 1003 M 44 2 NORTH_MIDLAND
1004 1004 F 33 2 NORTHERN
1005 1005 F 35 2 WESTERN
1007 1007 F 26 2 NEW_ENGLAND
1008 1008 F 52 1 MIXED
1010 1010 M 59 1 NEW_ENGLAND
1011 1011 F 27 2 SOUTH_MIDLAND

The other variables include these 4,051,206-element vectors derived from these for convenience, e.g.

Calls = WordDurs[,"Call"]
Callers = Call2Caller[Calls,2]

You may find it worthwhile to define some other similar vectors, e.g.

Speech = !WordDurs[,"Nonlex"]

... for reasons that will become clear below.

2. A simple plot

Let's look at mean word duration as a function of phrase length:

PLength = WordDurs[,"PLength"]
MeanDurs = vector(mode="numeric", length=15)
for(n in 1:15){
   which = ((PLength==n) & speech)
   MeanDurs[n] = mean(WordDurs[which,"Duration"])
}
plot(1:15,MeanDurs, xlab="Number of words in the pause group",
   type="b", col="red",
   ylab="Mean Duration per Word (seconds)",
   main="Data from Switchboard -- 3,072,313 words")

This yields:

3. R assignment #1 -- Due Monday 11/10/2014:

(1) Calculate the mean word duration by position for phrase lengths from 1 to 12. If everything works out, this should give you numbers and plots like those in "The shape of a spoken phrase", 4/12/2006.

(2) Do speaking rates change in the course of a conversation? Devise one or more ways to evaluate this question, and implement at least one of them.

If you don't know how to do something, try Googling it (e.g. cumulative sum in R, which will allow you to turn a vector of durations into a vector of end/start times), or reading the first few chapters of this tutorial.