Lattice Rescoring (WS99 Text Normalization)

Lattice Rescoring (WS99 Text Normalization)

Stanley F. Chen

1  Overview

This is documentation for calculating the highest scoring hypothesis in a lattice given a language model.

All of the tools are located in the directory

/home/ws99/sfc/pub/exec
To try the examples listed in the text, place this directory in your path and copy the files in
/home/ws99/sfc/pub/data/h034b
into your current directory.

2  Lattice Format

This section describes the format of lattices. Lattices for multiple utterances can be stored in the same file.

At the beginning of the lattice for each utterance, include the line

FSM-ID: <unique-ID>
The ID is arbitrary; it is used to help out later processing. At the end of the lattice for each utterance, put the line
END

To list an alternative for the nth word in a sentence, use the line:

<n-1>   <n>   <raw-word>   [<expanded-word>]   [<cost>]   [<tag>]
The fields should be separated by exactly a single tab. Multiple words in <expanded-word> should be separated by spaces. The last three fields are optional. If <expanded-word> is missing or empty, it is assumed to be the same as <raw-word>. To specify an empty <expanded word>, use the token <sil>. <cost> should be a log probability, base 10. <tag> is our internal tagging, e.g., ASWD.

Alternatives must be listed for words in the order they occur in the sentence, i.e., alternatives for the (n+1)st word must follow those of the nth word.

The following is a sample file (/home/ws99/sfc/pub/data/h034b/h034b.alt):

FSM-ID: a034c1
0       1       NATO    NATO    -0.1    ASWD
0       1       NATO    N. A. T. O.     -0.4    LSEQ
1       2       LIVES   LIVES   0       ASWD
2       3       ###     <sil>   0       SLNT
3       4       ON
4       5       AND
5       6       ON
END
FSM-ID: a034c2
0       1       NATO    NATO    -0.1    ASWD
0       1       NATO    N. A. T. O.     -0.4    LSEQ
1       2       LIVES   LIVES   0       ASWD
2       3       ###     <sil>   0       SLNT
3       4       ###     <sil>
END
Another sample file can be found in
/home/ws99/sfc/pub/data/h034b/e034e.alt

3  Calculating the Best Hypothesis

To calculate the best hypothesis in a lattice given a language model, first the lattice file must be converted into a format that the lattice rescorer tool can read. Use the command:

e034c.pl -aux <aux-file> <in-lattice> > <out-lattice>
The file <aux-file> is a file that is created to help later align the best hypothesis with the tokens in the original raw text. For example, the command
e034c.pl -aux h034b.aux e034e.alt > h034b.stxt
creates the auxiliary file h034b.aux and the converted lattice h034b.stxt.

To calculate the best hypothesis in a lattice given a language model, use the command

Lattice -lattype fsm-list -lmfile <LM-file> -unkpen 1e-10 -langwgt 1 \
    -outtype nbest -outfile <out-file> -1best.3 <lattice-file>
For example, the command
Lattice -lattype fsm-list -lmfile f034c.dmp -unkpen 1e-10 -langwgt 1 \
    -outtype nbest -outfile h034b.nbest -1best.3 h034b.stxt
creates the file h034b.nbest containing the best hypothesis for each utterance.

To create a file containing the alignment between the best hypothesis and the tokens in the original raw text, use the command:

f034e.pl <best-hyp-file> | f034f.pl <aux-file> > <out-file>
For example, the command
f034e.pl h034b.nbest | f034f.pl h034b.aux > h034b.align
creates the file h034b.align containing the alignment.


File translated from TEX by TTH, version 2.20.
On 19 Jul 1999, 16:10.