Automatic speech segmentation with HTK

Kyle Gorman
Department of Linguistics
University of Pennsylvania
Institute for Research in Cognitive Science
kgorman@ling.upenn.edu

Soon to be made obsolete: check back soon for details

Sections

Getting started: Corpora Software Prerequisites Staying organized
Bootstrapping the model: Creating MFCCs Initializing the model Re-estimation
Lower-level models: Fixing sil models Training Re-aligning data More training
Higher-level models: Going multi-Gaussian Going speaker-dependent Segmenting
Acknowledgements References Linguistics Penn LDC IRCS Get HTK Get scripts

Getting Started

Despite years of skepticism, the increasing power of desktop computers and greater availability of speech corpora has made corpus-based phonetic analysis an important method of linguistic inquiry. To perform this sort of analysis, it is often necessarily to have word- or phone-transcriptions time-aligned with audio data. For instance, if you were interested in the f2 properties of a diphthong in a dialect of English, you need a list of phones the diphthong occurred, and the start and end time indices, if you wished to automatically extract the f2 properties from the corpus of data.

START STOP PHONE
10.203 10.560 EY
18.902 19.234 EY
43.023 43.430 EY
54.323 54.840 EY
... ... ...

Above is a sample text file that could be obtained from a phone-aligned speech transcript using a tool like grep. The onset and offset of a certain phone are listed in ARPABET; this would be useful input to a script for Praat or some other software for phonetic analysis. However, not all corpora have this sort of segmentation. Even corpora purchased from other sources are often not segmented, and when you create your own recordings, it is rarely phone segmented. Segmentation by hand is exceedingly tedious and inter-annotator agreement is poor at best. Since the declassification of various algorithms and the application of Shannon Information Theory to human language, it has however become possible to induce these segmentations and alignments. Modern speech corpora are in fact not only segmented but also transcribed in this manner.

This method, along with methods for the related problem of building speech recognition systems, is detailed carefully in the HTK Book, a lengthy but invaluable resource for speech recognition. However, this tutorial is dedicated to another use of speech recognition algorithms, the the time alignment and segmentation of auditory data for which word-level transcriptions exist.

Corpora

There are two situations in which you might find yourself in posession of recordings and transcripts, but not time alignments of data. The first is in professional-produced speech corpora of the sort available from the Linguistic Data Consortium (expensive, so usually obtained by an academic institution's license) or from a source like CHILDES (an free online source of child and child-directed speech). Another source of data could be transcribed recordings from speech interviews, perhaps from sociolinguistic research. These are often tediously hand-coded, but automatic segmentation can be performed with enough data.

For this example, the data comes from professional sources. I'm trying to study the acquisition of intonation in children. It is my hope to automatically extract f0 (vocal pitch) information from time-aligned transcripts. For this, I will use a portion of the CHILDES corpus that consists of infant directed speech, both recordings and utterance-level transcripts. and high-quality audio, but lacks time alignments or phone segmentation.

It's known that child-directed speech is acoustically different than other domains of speech, not to mention the limited, often novel vocabulary and the simplicity of the structure; therefore, I'm building a comparative database of adult-directed speech. To match the task, I chose recordings from the ATIS corpus. ATIS data was recorded in a so-called "Wizard of Oz" simulation: participants thought they were talking to a computer, though the computer was "played" by a confederate. The reason this was done was to obtain the necessary speech recordings to build an automatic speech recognizer that could perform this very task. Speakers are told the "computer" is a system for automatically booking flights around America, and are asked to book flights. This is a similar task to the child-directed speech act: the domain is narrow, the lexical items are somewhat novel (toponyms), and the hearer is perceived as not being fully sentient.

An older version of this tutorial used data from ATIS2 and ATIS3 for which there are both audio and sentence transcripts. I'll also make use of some data from CHILDES, for which higher-level models will be used.

Software

The software core for speech segmentation comes from HTK, the Hidden Markov Model toolkit, originally developed by Microsoft but now maintained by the Cambridge University Engineering department. The package which contains a series of sophiscated algorithms for building and decoding hidden Markov Models. To obtain HTK, you must register with a username and password, then download the source code. Like anything in this field, it's designed to work on a UNIX-like environment (BSD, Solaris, Linux, Mac OS X), but perhaps because of the Microsoft connection, it's also compatible with MS Windows. You will need to register, and then compile HTK as instructed by the package's documentation.

However, actually operating HTK requires a variety of text manipulation. This sort of operation I perform with Perl and a variety of shell tools like grep and sed, but these operations can also be performed by a more-expansive tool like Python or a combination of shell scripting and legacy tools like awk. It's up to you, but I've created a package of extensible Perl tools for use in this process. It's available here [tar.gz], released under a BSD-style license. For users on systems which don't include Perl by default (that is, Windows), you will need to obtain an environment like ActivePerl. I don't, however, vouch for my scripts' abilities to work on Windows.

Prerequisites

You need the following to perform speech recognition:

Staying organized

This is the most important part of all, honestly. There are a few important axioms to keep in mind.

The directory structure I suggest is to keep scripts, lists, and configuration files in one directory (on UNIX-alikes, probably a /scratch or /scr directory). Then, inside that, have a folder for raw data (audio and transcripts, properly segmented), which I labeled data/. You also need another folder for your cepstral coefficients, mlf/, and ones your HMM coefficients (hmm00/, hmm01/, and so on). Lastly, a directory for your segmentations, seg/, is useful.

The HTK documentation suggests a few common file extensions (mlf, scp, etc.), but for other files it leaves them bare. I find this extremely confusing, so where possible I try to use UNIX-style file extensions.

Another useful tip is to keep a batch file which runs all the commands. I provide one called compile, which comes with the commands commented out. To trigger a comment, simply delete the comment.

One last comment, on permissions. When you are working on a UNIX-like system, some scripts may not work and will exit with a Permission denied error. This usually indicates you the user aren't authorized to run the script as it is. If the script is called script, issuing the following chmod command (+x means "give the current user permission to execute the file"):

[kgorman@harris atis]$ chmod +x script

Bootstrapping the model

Before we actually begin, create a directory for the task. Inside that, create a directory data/, and put sentence-length raw audio files (.wav files) and sentence-level transcripts (.lab files) in it. The format of your .lab files should be one sentence to a line, one line to a file, separated by spaces.

[kgorman@harris atis]$ cat data/8k3011ss.lab
FIND ME A FLIGHT THAT FLIES FROM MEMPHIS TO TACOMA

8k3011ss.lab contains the sentence, and 8k3011ss.wav contains the matching audio. If your data is scattered in a complex directory structure, use UNIX find with the pattern-matching -name (or case-insensitive -iname) flag to get all the file names.

Obtain the scripts and configuration files. Untar/Unzip them.

[kgorman@harris atis]$ wget http://ling.upenn.edu/~kgorman/papers/seg.tar.gz
[kgorman@harris atis]$ tar -xf seg.tar.gz

This is the sort of tarball which explodes in your working directory, so execute this in the directory you want all your config files, not the one for your data files or your home directory

Obtain the CMU pronunciation dictionary. As mentioned above, you may not wish to do this since an optimized version is included in the seg package.

[kgorman@harris atis]$ wget ftp://ftp.cs.cmu.edu/afs/cs.cmu.edu/data/anonftp/project/fgdata/dict/cmudict.0.6d

Create a wordlist. If your sentence-level files are in data/ and labeled with .lab extensions, issue the following command:

[kgorman@harris atis]$ cat data/*.lab | ./wordList | sort | uniq > word.list

Create a dictionary for the task. Fortunately, HTK provides a tool for this, HDMan. It also outputs all the phones used, in the same command.

[kgorman@harris atis]$ HDMan -m -w word.list -n phone1.list -l dlog dict.list cmudict.0.6

Two notes at this point. First off, the HTK documentation uses file names like names.txt or whatever. I think it makes more sense to call that file name.txt, since while it's a list of names, it's a namefile. Secondly, when I executed the following command, HDMan output a few error messages indicating that the dictionary was out of order. Using sort on the dictionary didn't resolve the problem, so it must be something more idiosyncratic that HTK is wanting. If this is a problem, just use the included cmudict.

This searches the wordlist for pronunciations, and puts all the necessary pronunciations in dict.list and a list of the phones used in phone1.list. HTK also inserts a phone-symbol sp, short for "small pause", at the end of each word. This is standard practice. This next step is a bit idiosyncratic. We want to create a list of phones which lacks sp, since it's difficult to bootstrap this model. To do so, we'll use another model (once it's constructed), the model for the phone sil, which we'll insert at the beginning and end of every file. So, phone1.list has both sil and sp, and phone0.list lacks sp. You can create this using a simple grep command. -v puts grep in a bizarro world where true is false, so any line that doesn't match 'sp' will be printed.

[kgorman@harris atis]$ grep -v 'sp' phone1.list > phone0.list

You also need to add sil to both files. This is simple.

[kgorman@harris atis]$ echo sil >> phone1.list
[kgorman@harris atis]$ echo sil >> phone0.list

Now, you need to put your orthographic transcriptions into the HTK label format, called mlf. The HTK tutorial provides a script to do this (prompts2mlf, which will be in the HTKTutorial/ directory from installing HTK), but it makes an assumption that your label files have as their first field the file name. If this is not the case, as it is for the ATIS data, you can use my script, label2mlf. All the label files go into a single mlf file.

[kgorman@harris atis]$ ./label2mlf data/*.lab > word0.mlf

If everything is going well, word0.mlf should look like the following (if you're using a corpus about booking flights):

#!MLF!#
"data/8k3011ss.lab"
FIND
ME
A
FLIGHT
THAT
FLIES
FROM
MEMPHIS
TO
TACOMA
.
"data/8k3012ss.lab"
I
WOULD
LIKE
TO
BOOK
A
...

...and so on. The same thing needs to be done for the phones in your transcripts. Luckily, we don't have to write a lengthy script to do this, since the label editor provided by HTK (HLEd) does this for us automatically. It takes a HTK editor script (which have the extension .led), mkPhones.led, which is included in the configuration and script tarball above. This adds the long silence phone sil but deletes short silence phone sp (we'll add it back in later: that's why it's mkPhones0.led and not mkPhones.led).

[kgorman@harris atis]$ HLEd -l 'data/' -d dict.list -i phone0.mlf mkPhones0.led word0.mlf

This assumes your data is in lower directory, data/.

Creating MFCCs

Now we are ready to create cepstra. What are cepstra (sg. cepstrum)? Well, they're like spectra, but not. They are the product of the Fourier transform of a spectrum. In this case, we'll create a certain variety of cepstral coefficients, the Mel Frequency Cepstral Coefficients (MFCCs), which is the standard in speech research. Unlike the real cepstra, the Mel scale is based off of perceptual results in human hearing. The physiological and psychological structure of human hearing has the effect of increasing the relative perception of intensity for some frequencies, and decreasing it for others, and MFCCs take this into account. To create the cepstra, which is the raw data used to form HMMs, we use the HTK tool HCopy. Though the name might strike you as weird, it makes "copies" of data, just in a different format (in this case, MFCCs). HCopy takes a single configuration file, named copy.cfg, and provided in the configuration/script tarball. The parameters are listed below.

SOURCEKIND = WAVEFORM
SOURCEFORMAT = WAVE
SOURCERATE = 625.0 # 16KHz sampling rate, in 100ns
TARGETKIND = MFCC_0 # MFCCs are the best choice, C_0 as an energy coefficient
TARGETRATE = 100000.0 # 10 ms targets
SAVECOMPRESSED = T # keep compressed output
SAVEWITHCRC = T # use checksums
WINDOWSIZE = 250000.0 # 25 ms window
USEHAMMING = T # use a hamming window
PREEMCOEF = 0.97 # first order preemphasis
NUMCHANS = 20 # 20 channels filtration
CEPLIFTER = 22 # 22 cepstral filters should be enough
NUMCEPS = 12 # make 12 MFCC cepstral coefficients
ENORMALISE = T # normalize intensity of data
NATURALREADORDER = T # Zhi-Jie Yan, p.c., should solve problems with byte order

The one you may have to change is SOURCERATE. This is the rate of sampling in 100ns intervals; the ATIS data was digitized at a 16Hz sampling rate, so 625.0 is the right SOURCERATE. Some of these settings are the default, others are suggested in the HTK Book. If you're interested what they all mean, you're advised to take a signal processing class.

If you're running HCopy on more than a few files, you'll want to create a source file. All UNIX shells have a rather limited number of arguments they can take, and even if they didn't, would you want to type all that out and wait for HCopy to load over and over again while it iterated through the data? You should create a file that contains the list of audio (.wav) files and the corresponding output (.mfc) files. This can be created by the included copyList script. The format is the filename of the .wav file, a space, then the filename of the output .mfc file, followed by a newline.

[kgorman@harris atis]$ ./copyList > copy.scp

copy.scp should look something like this (if your data isn't in data/, you will need to edit copyList):

data/8k3011ss.wav data/8k3011ss.mfc
data/8k3012ss.wav data/8k3012ss.mfc
data/8k3013ss.wav data/8k3013ss.mfc
data/8k3014ss.wav data/8k3014ss.mfc
data/8k3015ss.wav data/8k3015ss.mfc
data/8k3021ss.wav data/8k3021ss.mfc
data/8k3022ss.wav data/8k3022ss.mfc
data/8k3023ss.wav data/8k3023ss.mfc
data/8k3024ss.wav data/8k3024ss.mfc
data/8k3025ss.wav data/8k3025ss.mfc

Now, you can execute HCopy. If you don't see any error messages after the first few seconds, this is time for your first sandwich of the day.

[kgorman@harris atis]$ HCopy -T 1 -C copy.cfg -S copy.scp

Initializing the model

The next step is to initialize the monophone HMMs. These are called "flat-start" HMMs since they just take all states to be have the same mean and variance. We'll follow the suggestion given by the HTK Book and use a 3-state left-right model with a thirteen-value static vector plus the delta (change) coefficients plus the acceleration coefficients. That means 39 vector values. HTK doesn't have a tool for making the flat start vector, so I've just included the proto model as proto (HTK prevents us from giving this a file extension, but it shouldn't be a problem). Consult the HTK Book if you want to use a different model. This model takes the mean to be 0, and the variance 1.

From this model, HCompV scans the data, and computes the global mean and variance for the whole corpus, and outputs that. At this point, you'll need a directory hmm00/ (this isn't MATLAB, start counting at 0). HCompV also takes a training file list, which contains all the .mfc files. A script, compList, generates this file.

[kgorman@harris atis]$ ./compList > train.scp

This also requires another config file, entitled comp.cfg. This is similar to what you'd expect from the previous .cfg file, and a HTK-only sort of degree of complexity not really worth talking about. With this ready, you can generate the first stab at a model. This takes a few seconds, but nothing unbelievable.

[kgorman@harris atis]$ HCompV -C comp.cfg -f 0.01 -S train.scp -M hmm00/ proto

The new model is now in hmm00/. You can check it out if you want.

Re-estimation

For some reason, HTK allows us to re-estimate the flat start monophones, but doesn't provide us a useful tool to put the flat start variances into the files we need. The first is called macros contains the types of parameters and the size of the vector. The former is known from the proto file we generated, and the latter is stored in hmm00/vFloors. macros is created by concatenating the macros template (included in my set of configuration files/scripts) mactmp.txt and vFloors. To do so, issue the following.

[kgorman@harris atis]$ cat mactmp.txt hmm00/vFloors > hmm00/macros

The second part is a bit more complex. What you need to do is create HMM definitions which sets each unique phone symbol to be defined by the proto. There is no simple way to do this, but I've attempted to script it, so you don't have to. The script is called init. Since we don't have enough data to deal with short pauses yet, we will also make use of phones0.list as an input to init (we don't want an sp model yet).

[kgorman@harris atis]$ ./init phone0.list > hmm00/hmmdefs

The last point before re-estimation is that we need to generate new label files for HERest, since it expects phone-labeled data without sp. This fact is a bit poorly-documented in the HTK book, and a bit idiosyncratic. We'll use a script that parses phone0.mlf and puts the results into mfc/ as label files, which HTK expects.

[kgorman@harris atis]$ ./phone2label phone0.mlf

Now we are ready to re-estimate. HERest takes a .cfg files (we can use the one from the flat start generation), the phone0.mlf file we generated earlier, a set of pruning thresholds (I went with the suggested ones in the HTK Book), a training file list (used previously), macros, hmmdefs, and the sp-less monophones file phone0.list. The full command in all her glory is below.

[kgorman@harris atis]$ HERest -C comp.cfg -I phone0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm00/macros -H hmm00/hmmdefs -M hmm01/ phone0.list

Okay, we're going to do that again twice, but change the input/output directories. That'll improve our model, which will be in hmm03/ once we are done.

[kgorman@harris atis]$ HERest -C comp.cfg -I phone0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm01/macros -H hmm01/hmmdefs -M hmm02/ phone0.list
[kgorman@harris atis]$ HERest -C comp.cfg -I phone0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm02/macros -H hmm02/hmmdefs -M hmm03/ phone0.list

Training and segmenting

Now that we've created and trained several versions of the model we'll be using, it's time to fix a few assumptions that have been made on the way. The first is the two varieties of "silence" in the corpus. sil, which we already have a HMM for, goes at the beginning and end of sentences, and sp, which lacks an HMM. We'd expect these two to be similar, but not entirely the same, HMM and phones.

Fixing the silence models

Now, we've got a model for sil, but lack one for the small pause sp. To create this, we'll use a three-step process. We'll copy the middle state (state 3) from the HMM (that is what our models are) we've built for sil and transfer it to build a model for sp. Then we'll run a script to tie the states together and fill out the other transitions. First off, sil2sp extracts the sil state we want.

[kgorman@harris atis]$ ./sil2sp hmm03/hmmdefs > hmm04/hmmdefs

We also should copy hmm03/macros to hmm04, just for the sake of good organized practices.

[kgorman@harris atis]$ cp hmm03/macros hmm04/macros

Now we run a HHEd script. HHEd is like HLEd, except that it is a script-based editor for HHMs instead of label files. We will create states 2 and 4 for sp, and tie the middle state of sil to state 2 of sp. The script we'll use is included.

[kgorman@harris atis]$ HHEd -H hmm04/macros -H hmm04/hmmdefs -M hmm05 sil.hed phone1.list

Now we have HMMs for our silence models, which means we need to modify the label files that are used as input to HEREst. Specifically, they need sp labels. We accomplish this by generating a new MLF called phone1.mlf, and then using this as input to phone2label, which will store the new lists in mfc.

[kgorman@harris atis]$ HLEd -l 'data/' -d dict.list -i phone1.mlf mkPhones1.led word0.mlf
[kgorman@harris atis]$ ./phone2label phone1.mlf

Training

Now, go get a sandwich while you go through two more rounds of training.

[kgorman@harris atis]$ HERest -C comp.cfg -I phone1.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm05/macros -H hmm05/hmmdefs -M hmm06/ phone1.list
[kgorman@harris atis]$ HERest -C comp.cfg -I phone1.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm06/macros -H hmm06/hmmdefs -M hmm07/ phone1.list

Re-aligning data

At this point, we go through a re-alignment stage. Since we're not ready to get the segmentations output by HTK, we want to suppress that. The point of this re-alignment is to check for alternate pronunciations of words in the dictionary. The cmudict contains multiple pronunciations of words, as may your generated dictionary; at this step, HTK tries to figure out which pronunciation is more applicable.

Before we begin, we need to add a ``pronunciation'' for our silence model to the dictionary.

[kgorman@harris atis]$ echo "sil sil" >> dict.list

HVite is the command for data alignment (and more generally, decoding of our HMMs). It implements the Viterbi Algorithm, an ingenuous method for finding the most likely sequence in a probability distribution which works by making a strong assumption about the distribution which drastically reduces the search space. Without the Viterbi algorithm, much of modern machine learning would be nearly impossible. We call HVite on our HMM definitions and the current word0.mlf, and it outputs to a new phone-level MLF, phone2.mlf.

[kgorman@harris atis]$ HVite -o SWT -b sil -C comp.cfg -a -H hmm07/macros -H hmm07/hmmdefs -i phone2.mlf -m -t 250.0 -y lab -I word0.mlf -S train.scp -L data/ dict.list phone1.list

More training

With the most likely pronunciation chosen for each item in the dictionary, we begin two more rounds of training, this time on word1.mlf.

HERest -C comp.cfg -I phone2.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm07/macros -H hmm07/hmmdefs -M hmm08/ phone1.list
HERest -C comp.cfg -I phone2.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm08/macros -H hmm08/hmmdefs -M hmm09/ phone1.list

Higher level models

At this point, we have enough data to drastically refine our models. We accomplish this by making use of a more powerful representation of each of our phones, by using multi-Gaussian mixtures to model them. After we do this, we can further improve our models by building phone models which are speaker-dependent. HTK makes all this easy, if your data is appropriately prepared.

ATIS doesn't easily provide us with much information about the speaker, so at this point I'll be using examples strictly from the Brent corpus of infant-directed speech in CHILDES, which does.

Going multi-Gaussian

Going speaker dependent

Segmenting

At long last, we have (perhaps) a sufficient model to obtain time-aligned word and phone transcriptions. We'll use another instance of HVite to output the most likely alignments. The model works by adjusting alignments to maximize the degree to which phones cluster, so HTK will have computed the most likely location of every phone (within the linear order of a sentence), using the model we've built so far.

At this point, there is another possibility for refining the model before outputting the segmentations. One option is to build bi- or triphone models. The goal with these types of models is to effectively model co-articulation effects we know to occur pervasively in natural speech. Therefore, under a triphone model, the English word [bIt] 'bit' in isolation has the following triphones:

and so on. However, there is a major with this approach, data sparsity. Assuming there are approximately 40 phonemes in English (an assumption which is highly dependent on dialect), there are 40 monophones, but (40^2) = 1600 possible unique biphones, and (40^3) = 64000 possible triphones. Luckily, you won't encounter quite a few of these combinations (Pierrehumbert 1994), but even with large corpora (and the corpora I'm working with, ATIS, certainly isn't large) this creates a major data sparsity problem. If you chose to do this, though, consult the
HTK Book.

It's time to see how well your segmenter has worked. This is the first time you'll get a real sense of how well the process has gone, and if you're unsatisfied you can still run more estimations to see if they converge on something more satisfying. HVite will do the trick again, but this time we'll tell it to output the time alignments as well by not passing T to the -o flag.

[kgorman@harris atis]$ HVite -o SM -b sil -C comp.cfg -a -H hmm09/macros -H hmm09/hmmdefs -i word1.mlf -m -t 250.0 -y lab -I word0.mlf -S train.scp -L data/ dict.list phone1.list

The output is in word1.mlf.

If you did it right, you might see something like this:

[kgorman@harris atis]$ head -15 word1.mlf
#!MLF!#
"mfc/8k3011ss.lab"
0 8900000 sil sil
8900000 9500000 F FIND
9500000 10300000 AY
10300000 10800000 N
10800000 11100000 D
11100000 11100000 sp
11100000 11500000 M ME
11500000 12600000 IY
12600000 12600000 sp
12600000 13100000 AH A
13100000 13100000 sp
13100000 14400000 F FLIGHT
14400000 15200000 L

What's with those really long numbers? As mentioned above, HTK works in 100ns intervals (that's 10^-7 seconds). We'll write the word-level transcription into a text file that's more useful. These go in aligned/ by script default.

[kgorman@harris atis]$ ./wordLine word1.mlf

Then, if we look in aligned/, we'll see something like this:

[kgorman@harris atis]$ cat aligned/8k3011ss.lab
0.00 0.89 [silence]
0.89 1.11 FIND
1.11 1.26 ME
1.26 1.31 A
1.31 1.71 FLIGHT
1.71 1.84 [silence]
1.84 2.03 THAT
2.03 2.05 [silence]
2.05 2.43 FLIES
2.43 2.60 FROM
2.60 2.63 [silence]
2.63 3.07 MEMPHIS
3.07 3.17 TO
3.17 3.80 TACOMA
3.80 4.03 [silence]

It didn't crash and burn! Your files should generally look like that, but it's difficult to really evaluate the quality of alignments without looking at them relative to the audio file. For this, Praat is still probably the best tool, despite many defects. Praat can take TextGrid files as input, which specify various tiers of labels matching up with audio. The format is a bit idiosyncratic, so a script is included for this purpose.

[kgorman@harris atis]$ ./textGrid word1.mlf

This puts the TextGrids into textGrids/ (which you may need to create if you get an error).

This isn't as textually appealing as the generated label files, but here's a screenshot of it aligned with the audio in Praat.

Phoneticians will note it's pretty good; the high-band proturbance is marked as fricative [f], the lateral [l] marked around where voicing begins, the diphthong has narrow and strong formants rising up to a target, and [t] begins near a clear alveolar gesture followed by a closure. There's, at worst, a bit of bleed in terms of how the phones are segmented. Even where it's less than perfect, it picks out the centers of the phones rather well.

Well, that's about it. Happy segmenting! If you're considering using your segmentations to extract acoustic features, may I suggest Praat-Py? It's a welcome relief from Praat's dreadful scripting engine.

Acknowledgements

Thanks to Jiahong Yuan, who provided practical advice on HTK and influenced this work. I'm indebted to Mark Hasegewa-Johnson, Mark Liberman, and Richard Sproat for their instruction. Thanks also to Catherine Lai, Carolyn Quam, Stephen Isard, Keelan Evanini, and the Linguistic Data Consortium.

Contents © 2007 Kyle Gorman. If you would like to help me improve this tutorial, please email me.

References

  1. D.A. Dahl, et al. ATIS3 Training Data. 1994. Linguistic Data Consortium: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC94S19
  2. D.A. Dahl, et al. ATIS3 Test Data. 1995. Linguistic Data Consortium: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC95S26
  3. J. Garofalo, et. al. ATIS2. 1993. Linguistic Data Consortium: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC93S5
  4. D. Jurafsky and J.H. Martin. Speech and Language Processing. 2000. Upper Saddle River, New Jersey: Prentice-Hall, Inc.
  5. J. Pierrehumbert. "Syllable Structure and Word Structure," in P. Keating ed., Papers in Laboratory Phonology III. 1994. Cambridge: Cambridge Univ. Press.
  6. S. Young, et al. The HTK Book. 2005. Cambridge University Engineering Department: http://htk.eng.cam.ac.uk/docs/docs.shtml

[ ~ ]