Praat Scripts for Rapid Annotation

The situation:

A solution:

Here's simple TIMIT illustration.

One of the two TIMIT "calibration" sentences is "Don't ask me to carry an oily rag like that". We want to evaluate the curious idea that the /ae/ in "ask" is systematically fronter than the /ae/ in "that", and we're going to rely on F2 estimates to do so.

Getting the relevant filenames and times. The goal is to get a simple list of triples

FILENAME STARTTIME ENDTIME

such that each triple represents one of the places where (we think) our annotation should be done. We might end up working through the whole list; we might selected a sample to use as training material; we might annotate a sample because that gives us enough evidence for our scientific purposes.

This part of the process will be different in each case, depending on the structure of our dataset and what we want to annotate. So we won't focus on the details of our silly illustrative example, but just cut quickly to the result.

The version of TIMIT in harris.sas.upenn.edu:/plab/NewTimit has had its directory structure flattened out, so that all of the audio files are in one subdirectory wavs, and all of the phone files are in one subdirectory phones. In addition, we've created TextGrid files in the subdirectory textgrids.

The sentence we're interested in is called "SA2", and few lines of shell scripting will produce

We'll do things in that order (choosing the subset of files and then extracting the triples) because we want to annotate both vowels in each file.

The result is a list of 40 lines like this:

MHJB0_SA2 0.378 0.509
MHJB0_SA2 1.837 2.092
MPRD0_SA2 0.468 0.632
MPRD0_SA2 2.211 2.423
MWSB0_SA2 0.345 0.486
MWSB0_SA2 2.087 2.236
MTBC0_SA2 0.355 0.487
MTBC0_SA2 2.252 2.418
...

I've put the list in a file SampleAElist1.txt, and now we can execute /usr/local/bin/seq2script1 to create our Praat script. If we call the program without any command-line arguments, it gives us a hint about usage:

$ seq2script1
USAGE: seq2script textgridDIR audioDIR audioEXT CASEID
    From STDIN -- lines of the form
        FILEBASE STARTTIME ENDTIME

The command-line arguments are:

textgridDIR -- where the textgrids can be found
audioDIR -- where the audio files can be found
audioEXT -- what kind of audio files they are, e.g. wav, flac, mp3
CASEID -- arbitrary identifier to let you keep track of the script files, notes file, etc.

Let's try it:

$ DIR=/plab/NewTimit
$ cat SampleAElist1.txt | seq2script1 $DIR/textgrids $DIR/wavs wav SampleAElist1
script in  SampleAElist1.praat  -- you can add notes to  SampleAElist1.praatnotes

I've set up the basic dataset directory DIR as a shell variable, because of course you'll need to run this on an interactive machine rather than on harris, and the home of the NewTimit dataset may be different there.

Depending on the case, you might want to save corrected timespans, or formant or f0 values, or classifications -- we would need to modify the seq2script1 program in order to set that up. This requires both some knowledge of Praat scripting and some knowledge of the "little language" that the seq2script program is written in, namely gawk. If you look at the program and at the scripts it creates, you'll see where you need to intervene.

Saving information from the script.  In what we've done so far, the only way to make a record of your measurements and annotations is by adding text to the .praatnotes file. Here are a few ways to save information automatically.

One approach is to write to the "Info window" or to write to an external file. In /usr/local/bin/seq2script1a, we've arranged for the output Praat script to write F1, F2 & F3 values to an external file.

$ DIR=/plab/NewTimit
$ cat SampleAElist1.txt | seq2script1a $DIR/textgrids $DIR/wavs wav SampleAElist1
script in  SampleAElist1.praat  -- you can add notes to  SampleAElist1.praatnotes
   Output in  SampleAElist1.out

Another possibility is to save time information during the interactive inspection, and then to use that information later in a different program to extract and process relevant phonetic parameters. This is often preferable, since the Praat scripting language is so problematic. One general approach is to

So for example:

# Make local directory for annotation-enhanced TextGrids
mkdir annotation
#######################
# $SAMPLE is our list of FILEBASE STARTTIME ENDTIME triples
SAMPLE=SampleAElist1.txt
########################
# $DIR is whereever the audio and original TextGrid files are
DIR=/plab/timit1
#######################
# Now make the annotation-enhanced TextGrids
gawk '{print $1}' $SAMPLE | sort -u >filelist
for f in `cat filelist`
do
   AddTier $DIR/textgrids/$f.TextGrid > annotation/$f.TextGrid
done
######################
# And now make the praat script and corresponding notes file
#     in the normal way
cat SampleAElist1.txt | seq2script1 annotation $DIR/wavs wav SampleAElist1

Now there's a Praat script in SampleAElist1.praat, and after we've run it, we can run e.g.

$ tg_annotation2lab annotation/MHJB0_SA2_a.TextGrid
0.000 0.378 ""
0.378 0.508 "yes"
0.508 1.838 ""
1.838 2.090 "yes"
2.090 2.220 ""