Richard Sproat (Google Research) “Neural Models of Text Normalization for Speech Applications”
3400 N Charles St
Baltimore, MD 21218
Speech applications such as text-to-speech (TTS) or automatic speech recognition (ASR), must not only know how to read ordinary words, but must also know how to read numbers, abbreviations, measure expressions, times, dates, and a whole range of other constructions that one frequently finds in written texts. The problem of dealing with such material is called text normalization. The traditional approach to this problem, and the one currently used in Google’s deployed TTS and ASR systems, involves large hand-constructed grammars, which are costly to develop and tricky to maintain. It would be nice if one could simply train a system from text paired with its verbalization.
I will present our work on applying neural sequence-to-sequence RNN models to the problem of text normalization. Given sufficient training data, such models can achieve very high accuracy, but also tend to produce the occasional error — reading “kB” as “hectare”, misreading a long number such as “3,281” — that would be problematic in a real application. The most powerful method we have found to correct such errors is to use finite-state over-generating covering grammars at decoding time to guide the RNN away from “silly” readings: Such covering grammars can be learned from a very small amount of annotated data. The resulting system is thus a hybrid system, rather than a purely neural one, a purely neural approach being apparently impossible at present.
(Joint work with Ke Wu, Hao Zhang, Kyle Gorman, Felix Stahlberg, Xiaochang Peng and Brian Roark).