Underlying model of speech production and using it to rescore
HBR 14/07/1998


The Model of Speech Generation

In the rescoring algorithms that will be described later, a model of speech production is assumed. Given a phone segment sequence, and the timings of those segments, this model of speech generation produces the sort of acoustic patterns (in the form of MFCC's) that also occur in real speech (we hope!).

We also hope that our model will better describe speech patterns that conventional models, e.g. HMM's, and we can eventually exploit this property to improve recognition performance.

Our model of speech production is shown in Fig 1.

Fig 1. The Speech Generation Model.

Describe stages.
Targets and segments.
Dynamics.
Non-linear mapping.

To turn this into a stocastic model, to take account of the variability of speech (and the defects of the model), we can introduce random variables in various places. Possibilities include:

The simplest is to treat the whole of the production process as deterministic until the acoustic output is produced, and add Gaussian "noise" to this acoustic pattern to model the variability.


Scoring using the Hidden Dynamic Model of Speech Production

So, such a model of speech generation can be used in the rescoring box on the previous page.  The inputs to this rescoring box are:

  1. the phone segment sequence and the timings of these segments (Segmentation and Labelling in Fig 2).
  2. the acoustic data (e.g. MFCC's).

Fig 2. Scoring a phone alignment.

As the variation in the acoustics for a given phone sequence/timing is assumed Gaussian, then the log likelihood of the data given the phone sequence can be obtained via a distance calculation between the synthesised and observed acoustics.