Normalization of Non-Standard Words

Real text contains a variety of “non-standard” token types, such as digit sequences; words, acronyms and letter sequences in all capitals; mixed case words (WinNT, SunOS); abbreviations; Roman numerals; URL’s and e-mail addresses. Many of these kinds of elements are pronounced according to principles that are quite different from the pronunciation of ordinary words. Furthermore, many items have more than one plausible pronunciation, and the correct one must be disambiguated from context: IV could be “four”, “fourth”, “the fourth”, or “I.V.”

Normalizing or rewriting such text using ordinary words is an important issue for several applications. For instance, an essential feature of natural human-computer interfaces is that the computer be capable of responding with spoken replies or comments. A Text-to-Speech module synthesizes the spoken response from such text input and must be able to render such items appropriately into speech. In Automatic Speech Recognition nonstandard types cause problems for training acoustic as well as language models. More sophisticated text normalization will be an important tool for utilizing the vast amounts of on-line text resources. Normalized text is likely to be of specific benefit in information extraction applications.

This project will apply language modeling techniques to creation of wide coverage models for disambiguating non-standard words in English. Its aim is to create (1) a publicly available corpus of tagged examples, plus a publicly available taxonomy of cases to be considered and (2) a set of tools that would represent the best state of the art in text normalization for English.

Final Report


Team Members
Senior Members
Alan BlackUniversity of Edinburgh, CSTR
Stanley ChenCMU
Mari OstendorfBoston University
Richard SproatAT&T Labs
Graduate Students
Shankar KumarCLSP
Undergraduate Students
Christopher RichardsWilliams

Center for Language and Speech Processing