More complex model can't be used efficiently in decoding directly
Rescore N-best hypotheses from a conventional recogniser
Add correct transcription
Assess performance based on how often correct transcription is chosen.

The challenge is to find a rescoring algorithm that performs better than the conventional recogniser.
Fig 2. shows the necessary inputs required for training the rescorer.
A phone sequence derived from the correct transcription is aligned
to the acoustic data using some method, preferably automatic.
The aligner here could be a conventional HMM-based recogniser in forced-recognition
mode.

The next two pages are concerned with the contents of the rescoring box, how its parameters are trained, and also how it is used to rescore the acoustic match given the acoustic data and aligned phone sequences.