Despite years of skepticism, the increasing power of desktop computers
and greater availability of speech corpora has made
corpus-based phonetic analysis an important method of linguistic inquiry.
To perform this sort of analysis, it is often necessarily to have
word- or phone-transcriptions time-aligned with audio data. For instance,
if you were interested in the f2 properties of
a diphthong in a dialect of English, you need a list of phones
the diphthong occurred, and the start and end time indices, if you wished
to automatically extract the f2 properties from
the corpus of data.
START STOP PHONE
10.203 10.560 EY
18.902 19.234 EY
43.023 43.430 EY
54.323 54.840 EY
... ... ...
Above is a sample text file that could be obtained from a phone-aligned speech transcript using a tool like grep. The onset and offset of a certain phone are listed in ARPABET; this would be useful input to a script for Praat or some other software for phonetic analysis. However, not all corpora have this sort of segmentation. Even corpora purchased from other sources are often not segmented, and when you create your own recordings, it is rarely phone segmented. Segmentation by hand is exceedingly tedious and inter-annotator agreement is poor at best. Since the declassification of various algorithms and the application of Shannon Information Theory to human language, it has however become possible to induce these segmentations and alignments. Modern speech corpora are in fact not only segmented but also transcribed in this manner.
This method, along with methods for the related problem of building speech
recognition systems, is detailed carefully in the HTK Book, a lengthy but invaluable resource
for speech recognition.
However, this tutorial is dedicated to another use of speech recognition
algorithms, the the time alignment and
segmentation of auditory data for which word-level transcriptions exist.
There are two situations in which you might find yourself in posession
of recordings and transcripts, but not time alignments of data. The first
is in professional-produced speech corpora of the sort available from
the Linguistic Data Consortium
(expensive, so usually obtained by an academic institution's license) or
from a source like CHILDES
(an free online source of child and child-directed speech). Another source of
data could be transcribed recordings from speech interviews,
perhaps from sociolinguistic research. These are often tediously hand-coded,
but automatic segmentation can be performed with enough data.
For this example, the data comes from professional sources.
I'm trying to study the acquisition of intonation
in children. It is my hope to automatically extract f0
(vocal pitch)
information from time-aligned transcripts. For this, I will use a portion of the
CHILDES corpus that consists of infant directed speech, both recordings
and utterance-level transcripts.
and high-quality audio, but lacks time alignments or phone
segmentation.
It's known that child-directed speech is acoustically different
than other domains of speech, not to mention the limited, often novel
vocabulary and the simplicity of the structure; therefore, I'm building a
comparative database of adult-directed speech. To match the task,
I chose recordings from the ATIS corpus. ATIS data was recorded
in a so-called "Wizard of Oz" simulation: participants thought they were
talking to a computer, though the computer was "played" by a confederate.
The reason this was done was to obtain the necessary speech recordings to
build an automatic speech recognizer that could perform this very task.
Speakers are told the "computer" is a system for automatically booking
flights around America, and are asked to book flights. This is a similar task
to the child-directed speech act: the domain is narrow, the lexical items
are somewhat novel (toponyms), and the hearer is perceived as not being
fully sentient.
An older version of this tutorial used data
from ATIS2 and ATIS3 for which there are both audio and sentence transcripts.
I'll also make use of some data from CHILDES, for which
higher-level models will be used.
The software core for speech segmentation comes from HTK, the Hidden Markov Model toolkit, originally
developed by Microsoft but now maintained by the Cambridge
University Engineering department. The package which contains a series
of sophiscated algorithms for building and decoding hidden Markov Models.
To obtain HTK, you must register with a username and password, then download
the source code. Like anything in this field, it's designed to work
on a UNIX-like environment (BSD, Solaris, Linux, Mac OS X), but
perhaps because of the Microsoft connection, it's also compatible
with MS Windows. You will need to register, and then compile HTK as
instructed by the package's documentation.
However, actually operating HTK requires a variety of text manipulation.
This sort of operation I perform with Perl
and a variety of shell tools like grep and sed, but
these operations can also be performed by a more-expansive tool like
Python or a combination of
shell scripting and legacy tools like awk. It's up to you,
but I've created a package of extensible Perl tools for use in this
process. It's available here [tar.gz], released
under a BSD-style
license. For users on systems which don't include Perl by default (that is,
Windows), you will need to obtain an environment
like ActivePerl.
I don't, however, vouch for my scripts' abilities to work on Windows.
You need the following to perform speech recognition:
[kgorman@harris ~]$ uname -rspo RAM is more important, though. A desktop with 1 gig or more
is probably the minimum you'll be willing to wait.
It's nice not to have to get a tasty sandwich every time
you made a minor change. As you can see, I'm using a lot.
[kgorman@harris ~]$ cat /proc/meminfo | grep MemTotal
MemTotal: 3778012 kB
This is the most important part of all, honestly. There are a few important
axioms to keep in mind.
The directory structure I suggest is to keep scripts, lists, and
configuration
files in one directory (on UNIX-alikes, probably a
/scratch or /scr directory).
Then, inside that, have a folder for raw data (audio and transcripts,
properly segmented), which I labeled data/. You also need another folder for your cepstral
coefficients, mlf/, and ones your HMM coefficients (hmm00/,
hmm01/, and so on). Lastly, a directory for your segmentations,
seg/, is useful.
The HTK documentation suggests a few common file extensions (mlf,
scp, etc.), but for other files it leaves them bare. I find this
extremely confusing, so where possible I try to use UNIX-style file extensions.
Another useful tip is to keep a batch file which runs all the commands.
I provide one called compile, which comes with the commands
commented out. To trigger a comment, simply delete the comment.
One last comment, on permissions. When you are working on a UNIX-like
system, some scripts may not work and will exit with a Permission
denied error. This usually indicates you the user aren't authorized
to run the script as it is. If the script is called script,
issuing the following chmod command (+x means "give
the current user permission to execute the file"):
[kgorman@harris atis]$ chmod +x script
Before we actually begin, create a directory for the task. Inside that,
create a directory data/,
and put sentence-length raw audio files (.wav files)
and sentence-level transcripts (.lab files) in it.
The format of your .lab files should be one sentence to a line, one
line to a file, separated by spaces.
[kgorman@harris atis]$ cat data/8k3011ss.lab 8k3011ss.lab contains the sentence, and 8k3011ss.wav
contains the matching audio.
If your data is scattered in a complex directory structure, use UNIX
find with the pattern-matching -name (or case-insensitive
-iname) flag to get all the file names.
Obtain the scripts and configuration files. Untar/Unzip
them.
[kgorman@harris atis]$ wget http://ling.upenn.edu/~kgorman/papers/seg.tar.gz This is the sort of tarball which explodes in your working
directory, so execute this in the directory you want all your config
files, not the one for your data files or your home directory
Obtain the CMU pronunciation dictionary. As mentioned
above, you may not wish to do this since an optimized
version is included in the seg package.
[kgorman@harris atis]$ wget ftp://ftp.cs.cmu.edu/afs/cs.cmu.edu/data/anonftp/project/fgdata/dict/cmudict.0.6d
Create a wordlist. If your sentence-level files are in data/
and labeled with .lab extensions, issue the following command:
[kgorman@harris atis]$ cat data/*.lab | ./wordList | sort | uniq > word.list
Create a dictionary for the task. Fortunately, HTK provides a tool
for this, HDMan. It also outputs all the phones used,
in the same command.
[kgorman@harris atis]$ HDMan -m -w word.list -n phone1.list -l dlog dict.list cmudict.0.6
Two notes at this point. First off, the HTK documentation
uses file names like names.txt or whatever. I think it makes more
sense to call that file name.txt, since while it's a list of
names, it's a namefile. Secondly, when I executed
the following command, HDMan output a few error
messages indicating that the dictionary was out of order. Using
sort on the dictionary didn't resolve the problem, so it must
be something more idiosyncratic that HTK is wanting. If this is a problem,
just use the included cmudict.
This searches the wordlist for pronunciations, and puts all the necessary
pronunciations in dict.list and a list of the phones used
in phone1.list. HTK also inserts a phone-symbol sp,
short for "small pause", at the end of each word. This is standard practice.
This next step is a bit idiosyncratic. We want to create a list of phones which
lacks sp, since it's difficult to bootstrap this model.
To do so, we'll use another model (once it's constructed), the model
for the phone sil, which we'll insert at the beginning
and end of every file. So, phone1.list has both
sil and sp, and phone0.list lacks sp.
You can create this using a simple grep command. -v
puts grep in a bizarro world where true is false, so
any line that doesn't match 'sp' will be printed.
[kgorman@harris atis]$ grep -v 'sp' phone1.list > phone0.list
You also need to add sil to both files. This is simple.
[kgorman@harris atis]$ echo sil >> phone1.list Now, you need to put your orthographic transcriptions into the HTK
label format, called mlf. The HTK tutorial provides a script
to do this (prompts2mlf, which will be in the HTKTutorial/
directory from installing HTK), but it makes an assumption that your label files
have as their first field the file name. If this is not the case, as
it is for the ATIS data,
you can use my script, label2mlf. All the label files go into
a single mlf file.
[kgorman@harris atis]$ ./label2mlf data/*.lab > word0.mlf
If everything is going well, word0.mlf should look like the following
(if you're using a corpus about booking flights):
#!MLF!# ...and so on. The same thing needs to be done for the phones in your
transcripts. Luckily, we don't have to write a lengthy script to do this,
since the label editor provided by HTK (HLEd) does this for us
automatically. It takes a HTK editor script (which have the extension
.led), mkPhones.led, which is included in the
configuration and script tarball above. This adds the
long silence phone sil
but deletes short silence phone sp (we'll add it back in later:
that's why it's mkPhones0.led and not mkPhones.led).
[kgorman@harris atis]$ HLEd -l 'data/' -d dict.list -i phone0.mlf mkPhones0.led word0.mlf
This assumes your data is in lower directory, data/.
SOURCEKIND = WAVEFORM The one you may have to change is SOURCERATE. This is the rate of
sampling in 100ns intervals; the ATIS data was digitized at a
16Hz sampling rate, so 625.0 is the right SOURCERATE.
Some of these settings are the default, others are suggested in the HTK Book.
If you're interested what they all mean, you're advised to take a
signal processing class.
If you're running HCopy
on more than a few files, you'll want to create
a source file. All UNIX shells have a rather limited number of arguments
they can take, and even if they didn't, would you want to type all
that out and wait for HCopy to load over and over again while
it iterated through the data?
You should create a file that contains the list of audio (.wav)
files and the corresponding output (.mfc) files. This can
be created by the included copyList script. The format is
the filename of the .wav file, a space, then the filename
of the output .mfc file, followed by a newline.
[kgorman@harris atis]$ ./copyList > copy.scp
copy.scp should look something like this (if your
data isn't in data/, you will need to edit copyList):
data/8k3011ss.wav data/8k3011ss.mfc Now, you can execute HCopy. If you don't see any error messages
after the first few seconds, this is time for your first sandwich
of the day.
[kgorman@harris atis]$ HCopy -T 1 -C copy.cfg -S copy.scp
From this model, HCompV scans the data, and computes the
global mean and variance for the whole corpus, and outputs that.
At this point, you'll need a directory hmm00/ (this isn't
MATLAB, start counting at 0). HCompV also takes a training
file list, which contains all the .mfc files. A script,
compList, generates this file.
[kgorman@harris atis]$ ./compList > train.scp
This also requires
another config file, entitled comp.cfg. This is similar to
what you'd expect from the previous .cfg file,
and a HTK-only sort of degree of complexity not really worth
talking about. With this ready, you can generate the first stab
at a model. This takes a few seconds, but nothing unbelievable.
[kgorman@harris atis]$ HCompV -C comp.cfg -f 0.01 -S train.scp -M hmm00/ proto
The new model is now in hmm00/. You can check it out if you want.
For some reason, HTK allows us to re-estimate the flat start monophones,
but doesn't provide us a useful tool to put the flat start
variances into the files we need. The first is called macros
contains the types of parameters and the size of the vector.
The former is known from the proto file we generated,
and the latter is stored in hmm00/vFloors.
macros is created by concatenating the macros template (included
in my set of configuration files/scripts) mactmp.txt
and vFloors. To do so, issue the following.
[kgorman@harris atis]$ cat mactmp.txt hmm00/vFloors > hmm00/macros
The second part is a bit more complex. What you need to do is create
HMM definitions which sets each unique phone symbol to be defined
by the proto. There is no simple way to do this, but I've
attempted to script it, so you don't have to. The script is
called init. Since we don't have enough data to deal
with short pauses yet, we will also make use of phones0.list
as an input to init (we don't want an sp model yet).
[kgorman@harris atis]$ ./init phone0.list > hmm00/hmmdefs
The last point before re-estimation is that we need to generate
new label files for HERest, since it expects phone-labeled data
without sp. This fact is a bit poorly-documented in
the HTK book, and a bit idiosyncratic. We'll use a script that parses
phone0.mlf and puts the results into mfc/ as label
files, which HTK expects.
[kgorman@harris atis]$ ./phone2label phone0.mlf
Now we are ready to re-estimate. HERest takes a .cfg
files (we can use the one from the flat start generation),
the phone0.mlf file we generated earlier,
a set of pruning thresholds (I went with the suggested ones
in the HTK Book), a training file list (used previously),
macros, hmmdefs, and the sp-less
monophones file phone0.list.
The full command in all her glory is below.
[kgorman@harris atis]$ HERest -C comp.cfg -I phone0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm00/macros -H hmm00/hmmdefs -M hmm01/ phone0.list
Okay, we're going to do that again twice,
but change the input/output directories.
That'll improve our model, which will be in hmm03/ once we
are done.
[kgorman@harris atis]$ HERest -C comp.cfg -I phone0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm01/macros -H hmm01/hmmdefs -M hmm02/ phone0.list Now that we've created and trained several versions of the model
we'll be using, it's time to fix a few assumptions that have been
made on the way. The first is the two varieties of "silence"
in the corpus. sil, which we already have a HMM for,
goes at the beginning and end of sentences, and sp,
which lacks an HMM. We'd expect these two to be similar, but not
entirely the same, HMM and phones.
Now, we've got a model for sil, but lack one for the
small pause sp. To create this, we'll use a three-step process.
We'll copy the middle state (state 3) from the HMM (that is what
our models are) we've built for sil and transfer it to
build a model for sp. Then we'll run a script to
tie the states together and fill out the other transitions. First off,
sil2sp extracts the sil state we want.
[kgorman@harris atis]$ ./sil2sp hmm03/hmmdefs > hmm04/hmmdefs
We also should copy hmm03/macros to hmm04, just for
the sake of good organized practices.
[kgorman@harris atis]$ cp hmm03/macros hmm04/macros
Now we run a HHEd script. HHEd is like HLEd,
except that it is a script-based editor for HHMs instead of label files.
We will create states 2 and 4 for sp, and tie the middle state
of sil to state 2 of sp. The script we'll use
is included.
[kgorman@harris atis]$ HHEd -H hmm04/macros -H hmm04/hmmdefs -M hmm05 sil.hed phone1.list
Now we have HMMs for our silence models, which means we need to
modify the label files that are used as input to HEREst.
Specifically, they need sp labels. We accomplish this by generating
a new MLF called phone1.mlf, and then using this as input
to phone2label, which will store the new lists
in mfc.
[kgorman@harris atis]$ HLEd -l 'data/' -d dict.list -i phone1.mlf mkPhones1.led word0.mlf Now, go get a sandwich while you go through two more rounds of
training.
[kgorman@harris atis]$ HERest -C comp.cfg -I phone1.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm05/macros -H hmm05/hmmdefs -M hmm06/ phone1.list At this point, we go through a re-alignment stage. Since we're not ready
to get the segmentations output by HTK, we want to suppress that. The point
of this re-alignment is to check for alternate pronunciations
of words in the dictionary. The cmudict contains
multiple pronunciations of words, as may your generated
dictionary; at this step, HTK tries to figure out which
pronunciation is more applicable.
Before we begin, we need to add a ``pronunciation'' for
our silence model to the dictionary.
[kgorman@harris atis]$ echo "sil sil" >> dict.list
HVite is the command for data alignment (and more generally,
decoding of our HMMs). It implements the Viterbi Algorithm, an ingenuous method
for finding the most likely sequence in a probability
distribution which works by making a strong assumption about the
distribution which drastically reduces the search space.
Without the Viterbi algorithm, much of modern machine learning
would be nearly impossible. We call HVite on our
HMM definitions and the current word0.mlf,
and it outputs to a new phone-level MLF, phone2.mlf.
[kgorman@harris atis]$ HVite -o SWT -b sil -C comp.cfg -a -H hmm07/macros -H hmm07/hmmdefs -i phone2.mlf -m -t 250.0 -y lab -I word0.mlf -S train.scp -L data/ dict.list phone1.list
With the most likely pronunciation chosen for each item
in the dictionary, we begin two more rounds of training, this
time on word1.mlf.
HERest -C comp.cfg -I phone2.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm07/macros -H hmm07/hmmdefs -M hmm08/ phone1.list At this point, we have enough data to drastically refine our models. We
accomplish this by making use of a more powerful representation of each of
our phones, by using multi-Gaussian mixtures to model them. After we do this,
we can further improve our models by building phone models which are
speaker-dependent. HTK makes all this easy, if your data is appropriately
prepared.
ATIS doesn't easily provide us with much information about the speaker,
so at this point I'll be using examples strictly from the Brent corpus of
infant-directed speech in CHILDES, which does.
At long last, we have (perhaps) a sufficient model to obtain
time-aligned word and phone transcriptions. We'll use another instance
of HVite to output the most likely alignments. The model
works by adjusting alignments to maximize the degree to which
phones cluster, so HTK will have computed the most likely
location of every phone (within the linear order of a sentence),
using the model we've built so far.
At this point, there is another possibility for refining the model before outputting the segmentations.
One option is to build bi- or triphone models. The goal
with these types of models is to effectively model co-articulation effects
we know to occur pervasively in natural speech. Therefore, under a triphone
model, the English word [bIt] 'bit' in isolation has the following
triphones:
It's time to see how well
your segmenter has worked. This is the first time you'll get a real sense
of how well the process has gone, and if you're unsatisfied
you can still run more estimations to see if they converge on
something more satisfying. HVite will do the trick again,
but this time we'll tell it to output the time alignments as well
by not passing T to the -o flag.
[kgorman@harris atis]$ HVite -o SM -b sil -C comp.cfg -a -H hmm09/macros -H hmm09/hmmdefs -i word1.mlf -m -t 250.0 -y lab -I word0.mlf -S train.scp -L data/ dict.list phone1.list
The output is in word1.mlf.
If you did it right, you might see something like this:
[kgorman@harris atis]$ head -15 word1.mlf What's with those really long numbers? As mentioned above,
HTK works in 100ns intervals (that's 10^-7 seconds). We'll write
the word-level transcription into a text file that's more useful. These
go in aligned/ by script default.
[kgorman@harris atis]$ ./wordLine word1.mlf
Then, if we look in aligned/, we'll see something like this:
[kgorman@harris atis]$ cat aligned/8k3011ss.lab It didn't crash and burn! Your files should generally look like that,
but it's difficult to really evaluate the quality of alignments
without looking at them relative to the audio file. For this, Praat is still probably the best tool, despite
many defects. Praat can take TextGrid files as input, which specify various
tiers of labels matching up with audio. The format is a bit idiosyncratic,
so a script is included for this purpose.
[kgorman@harris atis]$ ./textGrid word1.mlf
This puts the TextGrids into textGrids/ (which you may need
to create if you get an error).
This isn't as textually appealing as the generated label files,
but here's a screenshot of it aligned with the audio in Praat.
Phoneticians
will note it's pretty good; the high-band proturbance is marked as fricative
[f], the lateral [l] marked around where voicing begins, the diphthong
has narrow and strong formants rising up to a target,
and [t] begins near a clear alveolar gesture
followed by a closure. There's, at worst, a bit of bleed in
terms of how the phones are segmented. Even where it's less than
perfect, it picks out the centers of the phones rather well.
Well, that's about it. Happy segmenting! If you're considering using
your segmentations to extract acoustic features, may I suggest
Praat-Py? It's a welcome
relief from Praat's dreadful scripting engine.
Thanks to Jiahong Yuan, who provided practical advice on HTK and
influenced this work. I'm indebted to Mark Hasegewa-Johnson,
Mark Liberman, and Richard Sproat for their instruction.
Thanks also to Catherine Lai, Carolyn Quam, Stephen Isard, Keelan
Evanini, and the Linguistic Data
Consortium.
Contents © 2007 Kyle Gorman. If you would like to help me improve
this tutorial, please email me.
Corpora
Software
Prerequisites
Linux 2.6.9-42.0.3.ELsmp x86_64 GNU/Linux
Staying organized
Bootstrapping the model
FIND ME A FLIGHT THAT FLIES FROM MEMPHIS TO TACOMA
[kgorman@harris atis]$ tar -xf seg.tar.gz
[kgorman@harris atis]$ echo sil >> phone0.list
"data/8k3011ss.lab"
FIND
ME
A
FLIGHT
THAT
FLIES
FROM
MEMPHIS
TO
TACOMA
.
"data/8k3012ss.lab"
I
WOULD
LIKE
TO
BOOK
A
...
Creating MFCCs
Now we are ready to create cepstra. What are
cepstra
(sg. cepstrum)?
Well, they're like spectra, but not. They are the product of the Fourier
transform of a spectrum. In this case, we'll create a certain variety
of cepstral coefficients, the Mel
Frequency Cepstral Coefficients (MFCCs), which is the
standard in speech research.
Unlike the real cepstra, the Mel scale is based off of
perceptual results in human hearing. The physiological and psychological
structure of human hearing has the effect of increasing the relative
perception of intensity for some frequencies, and decreasing it for
others, and MFCCs take this into account.
To create the cepstra, which is the raw data used to form HMMs, we use
the HTK tool HCopy. Though the name might strike you as weird,
it makes "copies" of data, just in a different format (in this case, MFCCs).
HCopy takes a single configuration file, named copy.cfg,
and provided in the configuration/script tarball.
The parameters are listed below.
SOURCEFORMAT = WAVE
SOURCERATE = 625.0 # 16KHz sampling rate, in 100ns
TARGETKIND = MFCC_0 # MFCCs are the best choice, C_0 as an energy coefficient
TARGETRATE = 100000.0 # 10 ms targets
SAVECOMPRESSED = T # keep compressed output
SAVEWITHCRC = T # use checksums
WINDOWSIZE = 250000.0 # 25 ms window
USEHAMMING = T # use a hamming window
PREEMCOEF = 0.97 # first order preemphasis
NUMCHANS = 20 # 20 channels filtration
CEPLIFTER = 22 # 22 cepstral filters should be enough
NUMCEPS = 12 # make 12 MFCC cepstral coefficients
ENORMALISE = T # normalize intensity of data
NATURALREADORDER = T # Zhi-Jie Yan, p.c., should solve problems with byte order
data/8k3012ss.wav data/8k3012ss.mfc
data/8k3013ss.wav data/8k3013ss.mfc
data/8k3014ss.wav data/8k3014ss.mfc
data/8k3015ss.wav data/8k3015ss.mfc
data/8k3021ss.wav data/8k3021ss.mfc
data/8k3022ss.wav data/8k3022ss.mfc
data/8k3023ss.wav data/8k3023ss.mfc
data/8k3024ss.wav data/8k3024ss.mfc
data/8k3025ss.wav data/8k3025ss.mfc
Initializing the model
The next step is to initialize the monophone HMMs. These are called
"flat-start" HMMs since they just take all states to
be have the same mean and variance. We'll follow the suggestion
given by the HTK Book and use a 3-state left-right
model with a thirteen-value static vector plus the delta (change)
coefficients plus the acceleration coefficients. That means
39 vector values. HTK doesn't have a tool for making the flat start
vector, so I've just included the proto model as proto (HTK
prevents us from giving this a file extension, but it shouldn't be a problem).
Consult the HTK Book if you want to use a different model. This model
takes the mean to be 0, and the variance 1.
Re-estimation
[kgorman@harris atis]$ HERest -C comp.cfg -I phone0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm02/macros -H hmm02/hmmdefs -M hmm03/ phone0.list
Training and segmenting
Fixing the silence models
[kgorman@harris atis]$ ./phone2label phone1.mlf
Training
[kgorman@harris atis]$ HERest -C comp.cfg -I phone1.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm06/macros -H hmm06/hmmdefs -M hmm07/ phone1.list
Re-aligning data
More training
HERest -C comp.cfg -I phone2.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm08/macros -H hmm08/hmmdefs -M hmm09/ phone1.list
Higher level models
Going multi-Gaussian
Going speaker dependent
Segmenting
and so on.
However, there is a major with this approach, data sparsity.
Assuming there are approximately 40
phonemes in English (an assumption which is highly dependent on
dialect), there are 40 monophones, but (40^2) = 1600 possible unique
biphones, and (40^3) = 64000 possible triphones. Luckily, you won't
encounter quite a few of these combinations (Pierrehumbert 1994),
but even with large corpora (and the corpora I'm working with, ATIS,
certainly isn't large) this creates a major data sparsity problem.
If you chose to do this, though, consult the
HTK Book.
#!MLF!#
"mfc/8k3011ss.lab"
0 8900000 sil sil
8900000 9500000 F FIND
9500000 10300000 AY
10300000 10800000 N
10800000 11100000 D
11100000 11100000 sp
11100000 11500000 M ME
11500000 12600000 IY
12600000 12600000 sp
12600000 13100000 AH A
13100000 13100000 sp
13100000 14400000 F FLIGHT
14400000 15200000 L
0.00 0.89 [silence]
0.89 1.11 FIND
1.11 1.26 ME
1.26 1.31 A
1.31 1.71 FLIGHT
1.71 1.84 [silence]
1.84 2.03 THAT
2.03 2.05 [silence]
2.05 2.43 FLIES
2.43 2.60 FROM
2.60 2.63 [silence]
2.63 3.07 MEMPHIS
3.07 3.17 TO
3.17 3.80 TACOMA
3.80 4.03 [silence]

Acknowledgements
References