Simple Chinese Forced Alignment

You need:

  1. An audio file. Single-channel, ideally either 8000 or 16000 sampling rate.
  2. A transcript. Simplified characters, UTF-8, with spaces between "words".

You can use sox to change sample rates, combine channels, etc. There are various Chinese word segmenters Out THere that sort of work -- The Stanford Word Segmenter may be useful.

The command-line call:

Calign.py [options] wavfile transcript output

where the (optional) options may include

-r sampling_rate -- override which sampling rate model to use, either 8000 or 16000
-a user_supplied_dictionary -- encoded in utf8, the dictionary will be combined with the dictionary in the model
-d user_supplied_dictionary -- encoded in utf8, the dictionary will be used alone, NOT combined with the dictionary in the model
-p punctuations -- encoded in utf8, punctuations and other symbols in this file will be deleted in forced alignment, the default is to use "puncs" in the model

Thus in /plab/L521 we have XUL001030.wav and XUL001030.txt, extracted from the 1997 Mandarin Broadcast News Speech and 1997 Mandarin Broadcast News Transcripts. The audio is 5.18 seconds long,

and the transcript is

英国 的 英格兰 和 威尔士 近来 遭到 了 自 一 七 四 九 年 以来 最 严重 的 旱灾

If we execute

Calign.py XUL001030.wav XUL001030.txt XUL001030.align

we get the output XUL001030.align, which starts like this:

#!MLF!#
"/tmp/myl_18299.rec"
0 6600000 sp 2252.594238 sp
6600000 7200000 y 40.222908 英国
7200000 7900000 i 85.038506
7900000 8800000 N 116.978119
8800000 9200000 g 54.454571
9200000 9600000 w 54.746742
9600000 10300000 > 133.321579
10300000 10300000 sp -0.076879 sp
10300000 10800000 d 49.715294 的
10800000 11300000 & 36.615112
11300000 11300000 sp -0.076879 sp
11300000 11600000 y 16.688311 英格兰
11600000 12400000 i 126.154274
12400000 12800000 N 74.454620
12800000 13100000 g 45.576759
13100000 13500000 & 26.238358
13500000 13900000 l 59.842278
13900000 14900000 @ 266.507843
14900000 15200000 n 0.055743
15200000 15200000 sp -0.076879 sp

After running the aligner, you should check the file MissingWords, which will include any words present in the cited transcript that are not in the pronouncing dictionary. You can create your own supplementary dictionary, as a plain text file in a format like this one, and supply it to the aligner via the -a flag.

We can turn the aligner output into a Praat TextGrid in the usual way:

align2textgrid XUL001030.align >XUL001030.TextGrid

[Forced aligner code by Jiahong Yuan]