Building Language Models (WS99 Text Normalization)
Building Language Models (WS99 Text Normalization)
Stanley F. Chen
Contents
1 Overview
2 Vocabulary Format
3 Corpus Format
4 Building an N-Gram Model
4.1 CountNGram
4.2 BuildNGram.x
5 Building an Interpolated Language Model
6 Evaluating a Language Model
1 Overview
This is documentation for building language models for use
in the text normalization project. This document
only covers a subset of the features of the language modeling
tools, but will be expanded on demand. Currently, this
document explains how to build n-gram language models,
and linear interpolations of n-gram models.
To build a language model, you need to have a vocabulary,
or list of unique words to include in the language model,
and a corpus of training data. The following sections
describe the format of each of these files, the
tools for building language models, and the tools for
evaluating their perplexity.
All of the tools are located in the directory
To try the examples listed in the text, place this
directory in your path and copy the files in
|
/home/ws99/sfc/pub/data/j015 |
|
into your current directory.
To view a list of command-line arguments and flags
for any program, run the program with no arguments.
For any argument corresponding to an input file, a compressed or
uncompressed file can be specified. To have an output file
be compressed, specify a filename with the .gz suffix.
2 Vocabulary Format
A vocabulary file must contain a list of unique words,
one to a line, with no extra spaces on lines or blank lines.
A vocabulary should contain the most frequent words in
the domain of concern, and can be at most 65,535 words.
A vocabulary must contain the following words:
- <s>, the beginning-of-sentence token
- </s>, the end-of-sentence token
- <UNK>, the unknown word
All words not in the vocabulary found in a corpus are mapped
to the unknown word.
An example vocabulary can be found at
|
/home/ws99/sfc/pub/data/j015/j015d.voc |
|
(This vocabulary is sorted roughly by word frequency.)
3 Corpus Format
A corpus contains the training text used to construct
a language model. The basic format for a corpus file is ASCII text
where all words/tokens are separated by white space. The token
</s> signals the end of a sentence. Newlines have no
special meaning. Tokens (other than the end-of-sentence token)
surrounded by angle brackets and beginning
with either a lower-case character or the character `/' followed
be a lower-case character are interpreted to be markup tokens
and are ignored. Files in this format must have the suffix
.txt (or .txt.gz if compressed).
Another corpus format supported is an executable, such as a
shell script, that outputs text in the above format to standard
output. Corpora in this format must have the suffix .x, .sh,
or .csh.
An example corpus can be found at
|
/home/ws99/sfc/pub/data/j015/j015d.train.txt |
|
4 Building an N-Gram Model
To build an n-gram model requires two steps:
- CountNGram takes a corpus and vocabulary as input and
outputs the counts of all n-grams in a binary format.
- BuildNGram.x takes the count files produced by
CountNGram and constructs the actual language
model (with suffix .dmp).
These steps are separated because the same count files can
be used to create multiple language models (using different
n-gram order or smoothing method).
4.1 CountNGram
The first step in building a language model is counting the
n-grams in your training text using CountNGram. The basic
usage of this program is:
|
CountNGram [ < flags > ] < corpus > < vocab > < out-base > |
|
The <corpus> argument is the name of the training corpus.
The <vocab> argument is the name of the vocabulary file.
NOTE: for best results, use a full pathname when specifying
the vocabulary.
The <out-base> argument specifies the base name of
all count files to be created;
i.e., all files created will begin with this path/prefix. This
argument determines which directory count files will be created in,
and this directory must have adequate free space.
The most important command line flags are:
- -n <val>
- This specifies the longest n-grams to count,
e.g., -n 3 is appropriate for a trigram model (and is the default).
A set of count files can be used to build n-gram models of that
order or below.
- -mem <MB>
- This specifies how many MB of memory the
program will use; the default is 100 MB. The larger this value,
the faster the program will run. Use a value comfortably
lower than the total amount of memory in the machine.
Run CountNGram with no arguments to view a listing of all flags.
To give an example, the command
|
CountNGram j015d.train.txt j015d.voc j015e |
|
will create the count files
|
j015e.count.1, j015e.count.2, j015e.count.3, j015e.count.check |
|
4.2 BuildNGram.x
To build an n-gram language model from the count files
produced by CountNGram, use the BuildNGram.x shell script.
The basic usage of this program is:
|
BuildNGram.x [ < flags > ] < count-base > < out-base > |
|
The <count-base> argument is the base name of the input count files,
and should be the same as the last argument used in the
corresponding run of CountNGram. The <out-base> argument
specifies the base name of all files to be created;
i.e., all files created will begin with this path/prefix. This
argument determines which directory the language model file
will be created in, and this directory must have adequate free space.
Most of these created files will be deleted, except for
the output language model <out-base>.dmp.
To view the command line flags of BuildNGram.x, run
Smooth with no arguments. The flags of these two programs
are identical except that the -arpabo flag is treated differently.
The most important command line flags are:
- -n <val>
- This specifies the order of the n-gram model to
construct; e.g., -n 3 is appropriate for a trigram model
(and is the default).
- -alg <smooth-alg>
- This specifies the type of
smoothing to use. The suggested choices are kneser-ney-mod-fix
if you don't have appropriate held-out data to optimize smoothing parameters
with, and kneser-ney-mod if you do. To specify a held-out
set, use the -heldout flag.
- -heldout <corpus>
- This specifies the held-out corpus
to use to optimize the parameters of the chosen smoothing technique,
if applicable.
- -cutoffs <count-cutoffs>
- This specifies the count
cutoffs to use. For example, for a trigram model the
flag -cutoffs 0,1,2 would specify using all unigrams,
ignoring bigrams with one count or less, and ignoring trigrams
with two counts or less. Better performance is achieved
with lower cutoffs, so this flag should be excluded unless
language model size is an issue.
To give an example, the command
|
BuildNGram.x -alg kneser-ney-mod -heldout j015d.heldout.txt -cutoffs 0,0,1 j015e j015e |
|
will create the language model file j015e.dmp. It will be
a trigram model smoothed with modified Kneser-Ney smoothing and excludes
trigrams with only one count. The smoothing parameters will be
chosen to optimize the likelihood of the held-out data j015d.heldout.txt.
If no held-out data is available, the command
|
BuildNGram.x -alg kneser-ney-mod-fix -cutoffs 0,0,1 j015e j015e2 |
|
would be appropriate. This will create the language model
j015e2.dmp with modified Kneser-Ney smoothing, where
smoothing parameters are calculated using a formula based
on counts in the training data.
5 Building an Interpolated Language Model
To create a language model that is a linear interpolation of
existing language models, a simple text file must be created.
The format of the file is
lmtype = "interp";
lm.0 = "<LM-file1>";
lm.1 = "<LM-file2>";
pr.0 = <wgt1>;
pr.1 = <wgt2>;
The spacing and ordering of these lines is not important.
The file name must end with the suffix ``.lm''. The
language models being combined must have been built using
the same vocabulary.
For example, the file
|
/home/ws99/sfc/pub/data/j015/j015e3.lm |
|
which contains the text
lmtype = "interp";
lm.0 = "j015e.dmp";
lm.1 = "j015e2.dmp";
pr.0 = 0.7;
pr.1 = 0.3;
is a language model that weights the first language model by 0.7 and
the second by 0.3.
More than two language models can be combined by adding
the corresponding lm.2 and pr.2 fields and so on.
6 Evaluating a Language Model
To evaluate the perplexity of a language model on some text, use
the command
|
EvalLM [ < flags > ] < corpus > < LM-file > |
|
The <corpus> argument is the name of the file containing
the evaluation corpus. The <LM-file>
argument is the name of the file containing the language
model to be evaluated.
The most important command line flag is:
- -n <val>
- This specifies the order of the n-gram model to
evaluate; e.g., -n 3 is appropriate for a trigram model
(and is the default). This value can be lower than the
actual order of the language model file specified; in this case
only the lower-order subset of the language model is used.
NOTE: setting this value lower than the actual order of
an n-gram model can yield very poor performance depending
on the smoothing method used to create the model.
Run EvalLM with no arguments to view a listing of all flags.
To give an example, the command
|
EvalLM j015d.heldout.txt j015e.dmp |
|
will output the perplexity of the trigram model j015e.dmp on
the heldout set j015d.heldout.txt.
File translated from TEX by TTH, version 2.20.
On 19 Jul 1999, 16:10.