LING521: Exercise 1

Note: This is really a conceptual exercise. By all means try to actually do it -- but the real point is to understand why there are problems, and what kinds of things they are...



The TIMIT dataset includes 10 sentences read by each of 630 speakers. There are two sentences that every speaker reads, 474 sentences read by seven speakers each, and 1890 sentences read by one speaker each. For background information about this dataset, see the LDC catalogue entry, and/or the original 1993 documentation.

On Harris, the directory /plab/timit1 contains (among other things) the sub-directories words, phones, and TextGrids. Each of those subdirectories contains 6300 files, one for each read sentence.

You will probably want to do this exercise using a copy of the timit1 directory on your own machine, which you can do by fetching /plab/timit1.tgz from Harris.

The format of the words and phones files is simple. Each is a text file whose names concatenates the speakerID and the sentence ID, with the extension ".wrd" or ".phn", e.g. phones/MZMB0_SA1.phn, words/MZMB0_SA1.wrd.

Each file contains one item per row (word or phonetic segment), with three space-separated fields presenting the start time, end time, and the item identity, e.g.

2200 5720 she
5720 9960 had
9960 11640 your
11640 16334 dark
16334 23101 suit
26440 27989 in
27989 34120 greasy
34120 39880 wash
39880 44365 water
46322 49538 all
49538 54175 year
        
0 2200 h#
2200 4440 sh
4440 5720 iy
5720 6440 hv
6440 9160 ae
9160 9640 dcl
9640 9960 d
9960 10520 y
10520 11640 er
[...etc...]

The times are in waveform samples -- and since the sampling rate is 16000/sec, you divide them by 16000 to translate them to times in seconds, so that 2200 samples → 2200/16000 = 0.1375 seconds

Depending on your programming background, you may find the tasks below trivial, or they may seem nearly impossible. If you run into (programming) difficulties, ask the instructor or one of the other course participants for help.


Task 1

What are the "phones" used in this dataset? And how many of each are there?

An easy way to do this, using old-fashioned unix commands would be:

cat phones/*.phn | gawk '{print $3}' | sort | uniq -c | sort -n

You can do it instead in Python or some other language if you want.

Now calculate the median duration (in seconds) for the phone "eh" (ARPABET for IPA [ɛ]).

An old-fashioned way to start would be

cat phones/*.phn | egrep ' eh$' | gawk '{printf("%f\n",($2-$1)/16000)}'

You could use Rscript to create a "deciles" program like this:

#!/usr/bin/env Rscript
input<-file('stdin', 'r')
X=scan(input,quiet=TRUE)
myprobs=seq(0.1,0.9,by=0.1)
dX=quantile(X,probs=myprobs)
for(n in 1:9){cat(sprintf("%.3f ",dX[n]))}
cat(sprintf("\n"))

...which you could use to extend an old-fashioned shell script like this:

#!/bin/bash
TIMIT=/plab/timit1
for phone in "a" "iy" "eh"
do
    echo -n "$phone "
    val=" $phone$"
    cat $TIMIT/phones/*.phn | egrep $val |
	gawk '{printf("%f\n",($2-$1)/16000)}' | deciles
done

which produces this output:

a 0.078 0.091 0.101 0.109 0.117 0.127 0.138 0.151 0.171 
iy 0.054 0.063 0.070 0.076 0.083 0.090 0.099 0.112 0.133 
eh 0.055 0.065 0.072 0.078 0.085 0.092 0.101 0.112 0.131 

Or again you could use some other language(s) of your choice -- but in any case you need to extend it to cover all 61 phone types, in order to make a table of duration deciles (quantiles 0.1 to 0.9) for all the phones. (This should be a table with 61 rows and 10 columns...)

Task 2

There are 9663 instances of ARPABET "iy" (= IPA [i]) in this dataset. Some of them are in word-medial syllables, like the first syllable of "peanut", while others are in word-final syllables, like the second syllable of "protein". Word position is one of many factors influencing the duration of phones -- others include phonetic context, stress, and phrasal position. There are also effects of individual speakers or speaker features such as age or dialect, effects of speaking style and context, etc. Researchers often want to investigate the effect of various combinations of such factors -- but given the layout of a dataset like TIMIT, even simple interactions can be difficult to explore.

To illustrate this point, your next task is to figure out how to divide the 9663 "iy" segments according to word position: specifically, final-syllable or non-final-syllable. This is tricky, because the phone and word segments are listed in separate tables that are connected only by their time spans; and syllable segmentation is not available, so either you need to impose it, or rely on some hack like "is this the last vowel segment in a word" (which itself is not all that easy to calculate).

To make it even harder, think about running a parser on the sentences involved -- say spacy -- and integrating the resulting information (parts of speech, boundaries of various phrase types as well s words, etc.) into the process.

Then contemplate (or actually figure out) what would be involved in adding information about phrasal position, part-of-speech, stress, speaker age, etc. Relevant additional information can be found in doc/spkrinfor.tbl, doc/spkrsent.tbl, doc/tiitdic.tbl.

Task 3

Now re-do tasks 1 and 2 starting from the files in the (Praat) TextGrid directory (and without using any results from those earlier tasks).

Have fun, perhaps by imaging what tortures await the inventors of the TextGrid format in the afterlife...

Task 4

What sort of data model and data management system would make tasks like these easy? Optionally, take a look at EMU-SDMS and think about what would be involved in using it for such things.