Integrating Signal Analysis Data with Time Estimates

Executive Summary: To convert an integer FrameNumber to a time value Time, the general rule is

Time = (FrameNumber-1)*FrameStep + Window/2

(where FrameStep and Window are expressed in seconds, and FrameNumbers are counted starting from 1).

In the other direction, to convert a time value Time to an analysis-frame number FrameNumber,

FrameNumber = (Time - Window/2)/FrameStep + 1

Note that whatever their source, time values in phonetics research are generally approximate, despite their apparent accuracy, and rarely have much meaning beyond a resolution of about 10 milliseconds.

The details

There are a two general sorts of issues that may come up when you want to connect the results of signal analysis -- f0, amplitude, formants, spectral parameters -- with time values.

One set of problems has to do with with how to read and write file formats. There are arbitrary levels of potential complexity here, so for the moment we'll provide cookbook solutions on a case-by-case basis.

The other set of problems has to do with how to relate time values -- which are generally decimal fractions representing seconds from the the start of an audio file, like 17.10865 -- to analysis-frame numbers -- which are generally integers like 3426.

In fact both kinds of numbers are quantized, but in different ways.

Time values: A time value like 17.10865 seems arbitrarily accurate, but in fact this is an illusion. If it comes from human measurements made interactively in a program like Praat, there are three possible sources of quantization:

One is the digital nature of the underlying audio file. A sampling frequency of FS samples per second implies a time quantum of 1/FS seconds, so that 22050 Hz sampling corresponds to a time granularity of

1/22050 ≅ 4.535147e-05 seconds,

or about .045 milliseconds.

But in the second place, the program is not capable of distinguishing interactive time-identifications finer than the number of screen pixels in a given display of the time region. So if Praat is showing you 1.428964 seconds of audio in a window that's devoting 970 pixels to the waveform display, each pixel is representing

1.428964/970 ≅ 0.001473159 seconds,

or about 1.5 milliseconds.

And in the third place, if you are making time-identifications on the basis of derived quantities within an interactive program such as a spectrogram or a pitch or formant display, then the time quantum is limited by the nature of that analysis. The standard time resolution in Praat of spectrogram displays is 1,000 spectra per second, which in principle might offer accuracy of one millisecond, but is further limited by the screen resolution; Praat's standard time step for pitch analysis is 0.75 divided by the pitch floor, so that if the minimum pitch is 60 Hz, then the time step will be

0.75/60 = .0125

or 12.5 milliseconds.

If the time value comes from a forced-alignment program, or a speech activity detection program, or etc., then it will have been derived from the analysis time step of that program, which is generally on the order of 10 milliseconds. And for how those get translated to time values, see below. (WARNING: some programs may do these translations wrong, in ways that may accumulate to large differences over the course of a long audio file, so be careful...)

Analysis frame numbers: If you produce f0 estimates by running get_f0a on an audio file, you'll get a sequence of lines like this:

  0.000000 0.000000 284.895020 0.597321
183.941391 1.000000 442.215668 0.823467
174.927322 1.000000 585.777344 0.927709
170.496811 1.000000 726.572571 0.986007
169.637421 1.000000 879.125671 0.989023

where the four columns are the f0 estimate, voicing status, RMS amplitude, and the serial cross-correlation at the lag corresponding to the f0 estimate.

There is one of these lines for every analysis frame. In a computer program like R or Matlab where array indices start from 1, the sample above would correspond to frame numbers 1 to 5; in a program like Python or C, where array indices start from 0, the frame numbers would be 0 to 4. (0-based indexing is logically preferable, but in the rest of this discussion, we'll assume 1-based indexing.)

What is an "analysis frame", and why does it matter?

Well, in order to estimate f0 or formants or spectral amplitudes or whatever, a program needs to analyze sound over some stretch of time. An f0 of 200 Hz, for example, means that the signal is approximately periodic at an interval of 1/200 of a second, which is 5 milliseconds. If the audio signal has a sampling frequency of FS=10000, this corresponds to 10000/200=50 samples. Therefore the analysis program will have based its estimate estimate of a frequency on comparing a vector of D samples starting at sample N with another vector of D samples starting at N+50 samples.

What "time" does this correspond to? Or rather, what time offset relative to the start of the audio file?

In this case, the f0 estimate is the result of a comparing of

N:(N+D-1) samples

to

(N+PERIOD):(N+D+50-1) samples.

If we suppose that N -- the starting point of the calculation -- is the first sample of the file, and that we want N=1 to correspond to time 0, and that we are comparing stretches of D= 80 samples = 8 milliseconds, then given the other values we've assigned, this translates to the discovery that

0 to 8 milliseconds

is approximately the same as

5 to 13 milliseconds

What time point should we assign to this discovery?

One answer would be: This is a claim about the time region from 0 to 13 milliseconds, so we should assign a time point in the middle of that region, namely 0.065 seconds.

Another answer would be: This is a claim about the time region from 0 to 8 milliseconds, so we should assign a time point in the middle of that region, namely 0.04 seconds.

The usual choice is the second one. This has the advantage that the time-value of a frame number doesn't depend on the content of the analysis -- and it avoids the weird outcome of having a frame that starts later actually have an earlier corresponding time value, which otherwise could happen in case of upward octave jumps.

So where does the whole "Frame Number" business come in?

We've just discussed an f0 analysis starting at speech sample 1 (= 0 seconds offset from the start of the file). What happens next?

Well, our program does the same sort of analysis again, starting FrameStep seconds later. We want to do this so as to have an adequate time-density of estimates, given the characteristic time scale of the phenomenon. Since we're generally dealing with frequencies in the general range of 100-400 Hz, corresponding to .010 to .0025 seconds, it makes sense to sample about 100 or 200 times per second, corresponding to nominal FrameStep values of 0.01 or 0.005 seconds.

So if our FrameStep is 5 milliseconds,, and our comparison span D is still 8 milliseconds, then the time value corresponding to the second analysis frame would be

0.005+0.008/2 = 0.009 seconds

And the time corresponding to the 3rd analysis frame (counting from 1) would be

(3-1)*0.005 + 0.008/2 = 0.014 seconds

IMPORTANT:

Some programs internally define the FrameStep in (integer) audio samples. This doesn't matter if it all works out to an integer number of samples -- thus a FrameStep of 0.005 seconds at a sampling frequency of 16000 Hz is exactly .005*16000 = 80 samples. But at a sampling frequency of 44.1 kHz (the weird irrational value chosen by a brain-damaged standards committee 50 years ago, based on obscure properties of U-matic video tapes -- don't ask...), a FrameStep of 0.05 seconds is 0.005*44100 = 220.5 samples.

This wouldn't be problem, except for the programs that operate internally on the assumption that the FrameStep is an integer. And even then, it's not an issue with short files -- over the course of three seconds, the difference is just 3*200*0.5/44100 = 0.006802721, or about 6 milliseconds, which you probably wouldn't notice.

But over the course of 40 minutes, the difference is 40*60*200*0.5/44100 = 5.442177, or almost 5 and a half seconds -- which you certainly would notice.

The get_f0a program warns you about this. If you ask it to analyze an audio file sampled at 22050 Hz with a FrameStep of 0.005 seconds, it will tell you what it's doing:

$ soxi test1.wav
Input File     : 'test1.wav'
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:00:05.24 = 115606 samples ~ 393.218 CDDA sectors
File Size      : 231k
Bit Rate       : 353k
Sample Encoding: 16-bit Signed Integer PCM
$ get_f0a -i test1.wav -f 0.05 >test1.f0
Frame step set to 0.004989 to exactly match signal sample rate.

What's going on? Well, a FrameStep value of 0.005 seconds at a sampling rate of 22050 Hz is

0.005*22050 = 110.25 samples

The program is rounding that down to 110 samples -- and converted back to seconds, that's about

110/22050 = 0.004988662

Since 110 samples is exactly what the program is using internally in advancing its analysis frames, in converting back and forth to times, you should set FrameStep to 110/22050.

And by the way, the default Window value in get_f0a is 0.01 seconds.

Working in R

As an example, in harris.sas.upenn.edu:/plab/FreshAir/LenaDunham, you'll find .wav, .trs, and .f0 files for this interview. The .wav file has a sample rate of 22050 Hz and is 59495040 samples long:

$ soxi LenaDunham.wav
Input File     : 'LenaDunham.wav'
Channels       : 1
Sample Rate    : 22050
Precision      : 16-bit
Duration       : 00:44:58.19 = 59495040 samples ~ 202364 CDDA sectors
File Size      : 119M
Bit Rate       : 353k
Sample Encoding: 16-bit Signed Integer PCM

59495040 samples at 22050 samples per second is

59495040/22050 = 2698.188 seconds

The .f0 file contains 540858 frames, and was calculated with a nominal FrameStep of 0.005 seconds:

$ wc LenaDunham.f0
 540858  2163432 21127339 LenaDunham.f0

By the formula given above the time corresponding to the last analysis frame should be

(540858-1)*110/22050 + .01/2 = 2698.158 seconds

This is 30 milliseconds shorter than the full file length, a difference that is explained by the fact that the program abandoned its analysis when the end of the comparison window required by the longest-period analysis within its range of choices reached the end of the file (in this case a period of 20 milliseconds, followed by a comparison span of 10 milliseconds). So far, so good.

If we use e.g.

$ untrs1.pl LenaDunham.trs | untrs2.pl

we'll find this transcript towards the end of the interview:

2552.41 2557.33 I don't think I used to know the difference between someone being eccentric and someone being
2557.33 2561.76 uh you know, a destructive nightmare. I think I thought that the two went hand-in-hand.
2561.76 2565.97 And now my life is full of people who are special and unusual and strange
2565.97 2570.73 but don't scare me or hurt me or put me in dangerous situations.
2570.73 2573.67 But I don't think I used to be able to make that distinction.

If we explore the file in Praat, we can display the pitch track for the phrase "I don't think I used to know the difference between someone being eccentric", which falls roughly from 2252.5 to 2255.8 seconds:

;

Now let's read the whole f0 file into R, and display the last syllable of "eccentric", which is approximately 2555.64 to 2555.88

> X = read.table("LenaDunham.f0")
> Window = 0.01; FrameStep=110/22050
> time1 = 2555.64; time2 = 2555.88
> frame1 = round((time1-Window/2)/FrameStep+1)
> frame2 = round((time2-Window/2)/FrameStep+1)
> frame2 = round((time2-Window/2)/FrameStep+1)
> f0sample=X[frame1:frame2,1]
> f0sample[f0sample==0] = NA
> times=(frame1:frame2 -1)*FrameStep + Window/2
> plot(times, f0sample)


The result: