|
Underlying model of speech production and using it to rescore
HBR 14/07/1998 |
|
| The Model of Speech Generation |
|
In the rescoring algorithms that will be described later, a model of speech production is assumed. Given a phone segment sequence, and the timings of those segments, this model of speech generation produces the sort of acoustic patterns (in the form of MFCC's) that also occur in real speech (we hope!).
We also hope that our model will better describe speech patterns that conventional models, e.g. HMM's, and we can eventually exploit this property to improve recognition performance.
Our model of speech production is shown in Fig 1.

Describe stages.
Targets and segments.
Dynamics.
Non-linear mapping.
To turn this into a stocastic model, to take account of the variability of speech (and the defects of the model), we can introduce random variables in various places. Possibilities include:
| Scoring using the Hidden Dynamic Model of Speech Production |
|
So, such a model of speech generation can be used in the rescoring box on the previous page. The inputs to this rescoring box are:

As the variation in the acoustics for a given phone sequence/timing is assumed Gaussian, then the log likelihood of the data given the phone sequence can be obtained via a distance calculation between the synthesised and observed acoustics.