Guide to Evaluating Text Normalization

Guide to Evaluating Text Normalization

Stanley F. Chen

Evaluating text normalization quality involves editing a simple text file. A sample excerpt of such a file is:

...en boot without *your config.sys     (<F5>       d...
...en boot without  your config dot sys ( < F. five >. d...
1

...ngers >of   the *jack slide, may be too narrow, causi...
...ngers >. of the  jack slide  may be too narrow  causi...
1

...   >>    >> How *did we determine there is a USB     ...
...>. >. >. >. How  did we determine there is a U. S. B....
1

     <<<I  am done *testing now... PC110              st...
     <<<I. am done  testing now    P. C. one one zero st...
1

... to look at the *website as you can compare the Type ...
... to look at the  website as you can compare the Type ...
1

...   off, pop the *battery out and then put it back.>>>
...ro off  pop the  battery out and then put it back >>>
1
Text comes in pairs of lines, the first line being the original raw text and the second line being normalized text. For each pair, you are supposed to evaluate the correctness of a single space-separated token. The particular token to evaluate is marked with an asterisk to the left and always starts in the same column. The number below the pair of lines is the judgement of correctness of that example; it is set to 1 originally, which denotes correct. Your job is to edit that character to be 1 for correct, 0 for incorrect, and m for misalignment. Misalignment means that it is not possible to make a judgement from the context presented, which is usually because the automatic alignment between pairs is wrong.

The <<< symbol signals the beginning of a paragraph, >>> signals the end of a paragraph, and ... denotes that the paragraph continues outside of the context. The actual text will be 132 characters wide, so it would be prudent to make your window at least this width before editing the file.

To decide whether a space-separated token is correct, use the following guidelines:


File translated from TEX by TTH, version 2.20.
On 31 Jul 1999, 17:34.