LING521

Selecting and Inspecting a Sample of Examples

(due _____)

Background

It's often helpful to look at (and listen to) a random sample of examples of some phenomenon. And since there are many large speech datasets Out There, it's in principle fairly easy to do this.

Unfortunately, despite some interesting attempts, there's still no generally-adopted format for such datasets, and no generally-available interactive method for accomplishing the desired selection and inspection. So for now, you need to program your own way through the problem. The goal of this exercise is to give you an idea of how to do this in one particular case, with the hope that you should be able to adapt the ideas to other cases of interest to you.

Note that these other cases are likely to be different in various ways -- the dataset layout and file formats, the basis for your selections, and perhaps the nature of your inspections. But the general approach will be the same:

  1. Find all the examples of the phenomenon of interest in the dataset;
  2. Make a random selection of appropriate size;
  3. Create (and run) a script for inspecting (and perhaps classifying) the selection;

It'll often be the case that the first step is imperfect -- for example, if all that you have in a time-aligned orthographic transcription, finding phonetically and/or morpho-syntactically characterized examples will require some ingenuity and will not have perfect results. And the phone-level description from today's forced-alignment systems generally tries to apply dictionary-derived segment sequences whose phonetic correspondence with the speech stream is variable at best.

But in general this is OK -- even if the result of your search is ore of relatively low quality, your inspection in step 3 can still extract a pure sample. (Though you should be careful of possible bias introduced in step 1...)

Example

We'll explore the realization of #ONSET1Vnt#ONSET2 sequences in the LibriSpeech corpus, where ONSET1 is non-nasal and ONSET2 starts with a stop.

  1. On Harris, create a subdirectory SelectionTest in your home directory, and download this script there. You could do this via e.g.
    wget http://ling.upenn.edu/courses/ling521/CheckCVntC1.sh
  2. Run that script, e.g. via
    sh CheckCVntT1.sh
  3. Copy the whole directory SelectionTest from Harris to your own machine, e.g. via
    scp -r harris.sas.upenn.edu:/home/YOURID/SelectionTest .
  4. Delete the contents of your SelectionTest directory on Harris to avoid filling up the disk
  5. On your local machine, read the scripts (or at least the first one) into Praat and take notes on what you find

Now choose some other phenomenon of interest to you, and modify the script (or write a new one) to explore it.