Building Language Models (WS99 Text Normalization)

Building Language Models (WS99 Text Normalization)

Stanley F. Chen

Contents

1  Overview
2  Vocabulary Format
3  Corpus Format
4  Building an N-Gram Model
    4.1  CountNGram
    4.2  BuildNGram.x
5  Building an Interpolated Language Model
6  Evaluating a Language Model

1  Overview

This is documentation for building language models for use in the text normalization project. This document only covers a subset of the features of the language modeling tools, but will be expanded on demand. Currently, this document explains how to build n-gram language models, and linear interpolations of n-gram models.

To build a language model, you need to have a vocabulary, or list of unique words to include in the language model, and a corpus of training data. The following sections describe the format of each of these files, the tools for building language models, and the tools for evaluating their perplexity.

All of the tools are located in the directory

/home/ws99/sfc/pub/exec
To try the examples listed in the text, place this directory in your path and copy the files in
/home/ws99/sfc/pub/data/j015
into your current directory.

To view a list of command-line arguments and flags for any program, run the program with no arguments. For any argument corresponding to an input file, a compressed or uncompressed file can be specified. To have an output file be compressed, specify a filename with the .gz suffix.

2  Vocabulary Format

A vocabulary file must contain a list of unique words, one to a line, with no extra spaces on lines or blank lines. A vocabulary should contain the most frequent words in the domain of concern, and can be at most 65,535 words. A vocabulary must contain the following words:

All words not in the vocabulary found in a corpus are mapped to the unknown word.

An example vocabulary can be found at

/home/ws99/sfc/pub/data/j015/j015d.voc
(This vocabulary is sorted roughly by word frequency.)

3  Corpus Format

A corpus contains the training text used to construct a language model. The basic format for a corpus file is ASCII text where all words/tokens are separated by white space. The token </s> signals the end of a sentence. Newlines have no special meaning. Tokens (other than the end-of-sentence token) surrounded by angle brackets and beginning with either a lower-case character or the character `/' followed be a lower-case character are interpreted to be markup tokens and are ignored. Files in this format must have the suffix .txt (or .txt.gz if compressed).

Another corpus format supported is an executable, such as a shell script, that outputs text in the above format to standard output. Corpora in this format must have the suffix .x, .sh, or .csh.

An example corpus can be found at

/home/ws99/sfc/pub/data/j015/j015d.train.txt

4  Building an N-Gram Model

To build an n-gram model requires two steps:

These steps are separated because the same count files can be used to create multiple language models (using different n-gram order or smoothing method).

4.1  CountNGram

The first step in building a language model is counting the n-grams in your training text using CountNGram. The basic usage of this program is:

CountNGram [ < flags > ] < corpus > < vocab > < out-base >
The <corpus> argument is the name of the training corpus. The <vocab> argument is the name of the vocabulary file. NOTE: for best results, use a full pathname when specifying the vocabulary. The <out-base> argument specifies the base name of all count files to be created; i.e., all files created will begin with this path/prefix. This argument determines which directory count files will be created in, and this directory must have adequate free space.

The most important command line flags are:

-n <val>
This specifies the longest n-grams to count, e.g., -n 3 is appropriate for a trigram model (and is the default). A set of count files can be used to build n-gram models of that order or below.
-mem <MB>
This specifies how many MB of memory the program will use; the default is 100 MB. The larger this value, the faster the program will run. Use a value comfortably lower than the total amount of memory in the machine.

Run CountNGram with no arguments to view a listing of all flags.

To give an example, the command

CountNGram j015d.train.txt j015d.voc j015e
will create the count files
j015e.count.1, j015e.count.2, j015e.count.3, j015e.count.check

4.2  BuildNGram.x

To build an n-gram language model from the count files produced by CountNGram, use the BuildNGram.x shell script. The basic usage of this program is:

BuildNGram.x [ < flags > ] < count-base > < out-base >
The <count-base> argument is the base name of the input count files, and should be the same as the last argument used in the corresponding run of CountNGram. The <out-base> argument specifies the base name of all files to be created; i.e., all files created will begin with this path/prefix. This argument determines which directory the language model file will be created in, and this directory must have adequate free space. Most of these created files will be deleted, except for the output language model <out-base>.dmp.

To view the command line flags of BuildNGram.x, run Smooth with no arguments. The flags of these two programs are identical except that the -arpabo flag is treated differently. The most important command line flags are:

-n <val>
This specifies the order of the n-gram model to construct; e.g., -n 3 is appropriate for a trigram model (and is the default).
-alg <smooth-alg>
This specifies the type of smoothing to use. The suggested choices are kneser-ney-mod-fix if you don't have appropriate held-out data to optimize smoothing parameters with, and kneser-ney-mod if you do. To specify a held-out set, use the -heldout flag.
-heldout <corpus>
This specifies the held-out corpus to use to optimize the parameters of the chosen smoothing technique, if applicable.
-cutoffs <count-cutoffs>
This specifies the count cutoffs to use. For example, for a trigram model the flag -cutoffs 0,1,2 would specify using all unigrams, ignoring bigrams with one count or less, and ignoring trigrams with two counts or less. Better performance is achieved with lower cutoffs, so this flag should be excluded unless language model size is an issue.

To give an example, the command

BuildNGram.x -alg kneser-ney-mod -heldout j015d.heldout.txt -cutoffs 0,0,1 j015e j015e
will create the language model file j015e.dmp. It will be a trigram model smoothed with modified Kneser-Ney smoothing and excludes trigrams with only one count. The smoothing parameters will be chosen to optimize the likelihood of the held-out data j015d.heldout.txt.

If no held-out data is available, the command

BuildNGram.x -alg kneser-ney-mod-fix -cutoffs 0,0,1 j015e j015e2
would be appropriate. This will create the language model j015e2.dmp with modified Kneser-Ney smoothing, where smoothing parameters are calculated using a formula based on counts in the training data.

5  Building an Interpolated Language Model

To create a language model that is a linear interpolation of existing language models, a simple text file must be created. The format of the file is

lmtype = "interp";
lm.0 = "<LM-file1>";
lm.1 = "<LM-file2>";
pr.0 = <wgt1>;
pr.1 = <wgt2>;
The spacing and ordering of these lines is not important. The file name must end with the suffix ``.lm''. The language models being combined must have been built using the same vocabulary.

For example, the file

/home/ws99/sfc/pub/data/j015/j015e3.lm
which contains the text
lmtype = "interp";
lm.0 = "j015e.dmp";
lm.1 = "j015e2.dmp";
pr.0 = 0.7;
pr.1 = 0.3;
is a language model that weights the first language model by 0.7 and the second by 0.3.

More than two language models can be combined by adding the corresponding lm.2 and pr.2 fields and so on.

6  Evaluating a Language Model

To evaluate the perplexity of a language model on some text, use the command

EvalLM [ < flags > ] < corpus > < LM-file >
The <corpus> argument is the name of the file containing the evaluation corpus. The <LM-file> argument is the name of the file containing the language model to be evaluated.

The most important command line flag is:

-n <val>
This specifies the order of the n-gram model to evaluate; e.g., -n 3 is appropriate for a trigram model (and is the default). This value can be lower than the actual order of the language model file specified; in this case only the lower-order subset of the language model is used. NOTE: setting this value lower than the actual order of an n-gram model can yield very poor performance depending on the smoothing method used to create the model.

Run EvalLM with no arguments to view a listing of all flags.

To give an example, the command

EvalLM j015d.heldout.txt j015e.dmp
will output the perplexity of the trigram model j015e.dmp on the heldout set j015d.heldout.txt.


File translated from TEX by TTH, version 2.20.
On 19 Jul 1999, 16:10.