LING521 - Spring 2017

Exercise #2 -- downloading and processing an audio book. Due Monday 2/20/2017. Turn in (on the course canvas site) a set of informal notes on the exercise, including at least links to the texts and audio you chose, and a list of the resulting (corresponding) text and audio files.

You're going to

In the next exercise, you'll learn to make a Praat script to annotate some features of interest.

The instructions below are fairly detailed. Feel free to do things differently if you know how, or want to learn. Ask someone (Google, a fellow student, me, ...) if you get stuck.

1. Log in to harris.sas.upenn.edu.

2. On your laptop, check out librivox.org and choose an audiobook recording to process. As an illustration, I'm going to process a reading of Joseph Conrad's Heart of Darkness. Make your own choice as you please, and translate the following for-instance instructions appropriately.

3. First, download the text. The LibriVox page for the book has section "Links" which includes a link to "Online text":


In this case, the "Online text" link takes you to a page at Project Gutenberg, which offers a choice of six formats.

We want the "Plain Text UTF-8" version --and

right-click>>Copy Link Address    [on Windows]

or

control-click>>Copy Link Address   [on OS-X or Linus]

tells us that the link in this case is

http://www.gutenberg.org/files/219/219-0.txt

On harris, make a directory (under your home directory) with an appropriate name. Then execute something like

$ mkdir HeartOfDarkness    # make a new directory
$ cd HeartOfDarkness       # change directory to the new directory
                           # now use 'wget' to fetch the file
$ wget http://www.gutenberg.org/files/219/219-0.txt

[Note that the dollar signs are computer prompts -- don't type them! Your prompts will probably be more complicated...]

The computer will print back at me:

--2017-02-14 19:39:47--  http://www.gutenberg.org/files/219/219-0.txt
Resolving www.gutenberg.org (www.gutenberg.org)...
2610:28:3090:3000:0:bad:cafe:47, 152.19.134.47
Connecting to www.gutenberg.org
(www.gutenberg.org)|2610:28:3090:3000:0:bad:cafe:47|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 236398 (231K) [text/plain]
Saving to: ‘219-0.txt’
219-0.txt           100%[=====================>] 230.86K --.-KB/s   in 0.06s  2017-02-14 19:39:47 (3.84 MB/s) - ‘219-0.txt’ saved [236398/236398]

which is a complicated way of say 'OK, boss.' The results are in an obscurely named file -- here 219-0.txt -- which also has a format problem:

$ file 219-0.txt
219-0.txt: UTF-8 Unicode (with BOM) text, with CRLF line terminators

The BOM = "byte order mark" is fine. But the CRLF (= "Carriage Return + Line Feed") line terminators are an old DOS thing that may confuse and annoy some unix programs. So we do this:

$ dos2unix 219-0.txt xxx.txt
dos2unix: converting file 219-0.txt to Unix format ...
$ file 219-0.txt
219-0.txt: UTF-8 Unicode text
$ mv 219-0.txt HeartOfDarkness.txt     # rename file

4. Now we download the audio. There are several possible ways to do this -- the easiest one is probably the following.

First, let's note that LibriVox makes audio available in two formats:

We want the higher quality versions. Looking again at the "Links" section on the LibriVox page for the book, we see two ways to get at those versions: one is through the "Internet Archive Page" and the other is through the "Download M4B (119MB)" link.

If you go with the "Downloads M4B" option, you'll need to use something like /usr/local/bin/unm4b to unpack the segments. But we're going to go the "Internet Archive Page" route -- and clicking there, we get to a page with a panel of "Download Options":

If we click on the first line "128KBS MP3", we get:

And if we click on the "6 files" button, it will offer to download a zip file containing the six files. You could download them to your laptop or other interactive machine, and then copy them to harris. But in order to download directly to harris, we can use the Copy Link Address method to get the URL, and then in our harris terminal window we can do:

$ wget -O X.zip https://archive.org/compress/heart_of_darkness/formats=128KBPS%20MP3&file=/heart_of_darkness.zip

After the computer tells us complicated things about its prospects and progress, we can unzip X.zip (and then remove it to reclaim the file space...):

$ unzip X.zip
Archive:  X.zip
 extracting: heart_of_darkness_1b_conrad.mp3  
 extracting: heart_of_darkness_1a_conrad.mp3  
 extracting: heart_of_darkness_3a_conrad.mp3  
 extracting: heart_of_darkness_2a_conrad.mp3  
 extracting: heart_of_darkness_2b_conrad.mp3  
 extracting: heart_of_darkness_3b_conrad.mp3  
$ rm X.zip
$

Now we can use soxi to check the format of these files, e.g.

$ soxi heart_of_darkness_1a_conrad.mp3

Input File     : 'heart_of_darkness_1a_conrad.mp3'
Channels       : 1
Sample Rate    : 44100
Precision      : 16-bit
Duration       : 00:44:10.42 = 116883654 samples = 198782 CDDA sectors
File Size      : 42.4M
Bit Rate       : 128k
Sample Encoding: MPEG audio (layer I, II or III)
Comments       : 
Title=Chapter 1 Part 1
Artist=Joseph Conrad
Album=Heart of Darkness
Tracknumber=1
Year=2006
Genre=101

We want to turn these .mp3 files at 44.1 kHz sampling rate to .wav files at 16 kHz sampling rate, so we do something like this:

$ for section in 1a 1b 2a 2b 3a 3b
> do
> sox heart_of_darkness_"$section"_conrad.mp3 HeartOfDarkness"$section".wav rate 16000
> done
$

[Note that the '>' characters are prompts -- the computer telling you that you're in the middle of shell loop -- and you shouldn't type them...]

You'll now have six .wav files -- and you can delete the .mp3 files.

$ ls *.wav
HeartOfDarkness1a.wav  HeartOfDarkness2a.wav  HeartOfDarkness3a.wav
HeartOfDarkness1b.wav  HeartOfDarkness2b.wav  HeartOfDarkness3b.wav
$ rm *.mp3
$

5. OK, now the native unix guides and bearers are abandoning you to return to their digital villages, and you're left in the digital jungle to solve the remaining problems by yourself.

Once this is all done, you'll be able to run the forced aligner on each of the text/audio file pairs as you learned to do last week. In the next exercise, we'll go on to explore how to annotate an aligned audiobook (or a set of aligned audiobooks) for some interesting phonetic characteristic(s).

NOTE: There are various possible character-set issues, in general and specifically with Gutenberg UTF-8 texts.

Problem 1:  At the moment, segment.py assumes the old-fashioned idea that double and single quotation marks are simply:

U+0022 Quotation Mark "
U+0027 Apostrophe '

And the assumption is also that Apostrophe is Apostrophe.

But the Gutenberg texts often use these for single and double quotation marks:

U+2018 Left Single Quotation Mark ‘
U+2019 Right Single Quotation Mark ’
U+201C Left Double Quotation Mark “
U+201D Right Double Quotation Mark ”

And the Right Single Quotation Mark is often used for Apostrophe.

Those usages will leave segment.py deeply confused.

There are some other problems, such as -- or — ("m-dash") written solid with text to the left and right, and the use of flanking _ characters to indicate italics.

To fix all of these, run a command like

fixgutenberg xyz.txt >Nxyz.txt

before doing the forced alignment.