CLSP Homepage : Workshop Homepage
Workshop 2000

Pitch Tracking and Epoch Detection

Paul Bamberg
Dragon Systems, Inc.

References: David Talkin, "A Robust Algorithm for Pitch Tracking (RAPT), Ch. 14 in Kleijn and Paliwal, Speech Coding and Synthesis
which says
"The documentation included below is intended to provide sufficient detail to reproduce exactly the results obtained using the get_f0 program in the waves+ software package for Entropic Research Laboratory, Inc."

Talkin and Rowley, "Pitch-synchronous Analysis and Synthesis for TTS Systems"

Pitch Tracking has the goal of determining, as a function of time within a spoken utterance,
1) whether the speech is voiced or unvoiced.
2) if it is voiced, what is the fundamental frequency.

Reasonable people can disagree about the correct answer to the above question. A reasonable criterion for a "good" answer is that you hear no "glitches" due to pitch or voicing errors when you use it as a basis for resynthesis.

As a practical matter, choose a frame period like 5 ms (200 frames per second) and make a voicing decision for each frame and an f0 assignment for each voiced frame.

Local Voicing and f0 Assignment
Voiced speech should correlate very well with itself at a time lag of one fundamental period or an integer multiple thereof.
Unvoiced speech should not correlate well with itself for any time lag.

So for each frame we take the speech in a window of about 7.5 ms and calculate its normalized cross-correlation with the speech signal in windows at various "lags" in the future (in the past would work equally well). Lags should range from less than 2 ms (for f0 = 500 Hz) to more than 20 ms (for f0 = 50 Hz)

If the speech in one window is a (scaled) replica of the other, the correlation will be 1.0. If two periods are very similar, a correlation of 0.98 is not unreasonable.

Adopt the convention that "low scores are good"

Then the maximum correlation can serve as a "score" for the "unvoiced" hypothesis.

For the voiced hypothesis, there will be candidates for different lags. One is correct, but we will see multiples of it, and we will also see high correlations at the inverse of formant frequencies, usually f1. Display this as a "correllogram."

Use (1 - correlation) as the score, but add in a penalty that increases linearly with lag time. This often gives the best score to the lag corresponding to the true f0.

To save time, do this analysis using a decimated waveform, then repeat the analysis on the full waveform near the best peaks.

Connecting up the local estimates:

For each frame, we now have
1) a score for the unvoiced hypothesis
2) scores for one or more voiced hypotheses

But just choosing the best-scoring hypothesis for each frame is likely to give a pitch track that is manifestly wrong, violating non-local features of f0 like

1) Transitions between voiced and unvoiced speech are infrequent and tend to occur near phoneme boundaries. If frames n-1 and n+1 are both clearly voiced, so is frame n!

We can implement this idea by adding to the total score a penalty for each voiced-unvoiced transition, a sum of three terms:

a) a constant

b) a term that is positive (bad) when the spectrum is very steady and negative (good) when the spectrum is changing rapidly. Implementation (Itakura): see how well the LPC coefficients for one window predict the speech in a window 20 ms away.

c) a term that favors unvoiced-to-voiced transitions when RMS power is increasing and that favors voiced-to-unvoiced transitions when RMS power is decreasing.

2) f0 varies continuously and fairly slowly as a function of time, with some exceptions:

a) It might change by almost a precise factor of 2.

b) Near glottal stops and near certain consonants it can change rapidly.

c) In "vocal fry" at the end of an utterance f0 can be erratic.

Still, it is reasonable to add into the total score a term proportional to the time integral of |d/(log f0)/dt|. This will have to be approximated by differences, but the total score should not depend on the frame rate.

Put this all together, and it is a straightforward exercise in dynamic programming to find the sequence of hypotheses (one for each frame) that minimizes the total score.

Epoch detection:
Now we know the time interval between instants of glottal closure (the inverse of the fundamental frequency). There remains the task of identifying the precise time as which each glottal closure occurs.

Basic strategy:
Do linear prediction of the waveform (LPC analysis) using about 14 coefficients (predict each sample as a linear combination of the preceding 14 samples, minimizing the sum of squares of the errors on this prediction (the "residual")

From such analysis (if the details are handled right) we can construct a "glottal waveform" (representing the derivative of the airflow through the glottis) that has a sharp cusp whenever the glottis closes.

Now every peak (of the correct sign) is a candidate for an epoch. The higher and sharper the peak, the better its local score. We have constraints like this:

1) Epochs should occur in voiced regions and not in unvoiced regions.

2) The interval between epochs should be close to the inverse of f0.

3) Given three successive epochs, the middle one should be roughly equidistant from the other two.

Again it is an exercise in dynamic programming to find, from among all the epoch candidates, a subset of "true" epochs that minimizes an appropriately chosen score. With skill these epochs can be localized to within about 10 microseconds (one-tenth of the sample spacing at 11 kHz)

Return to Preliminary Schedule