Brian Roark (Google AI): Romanization, non-standard orthography and text entry

In this talk, we present issues in natural language modeling for text entry in languages that use noisy (i.e., non-standard) romanization strategies, with a particular focus on languages using Indic scripts. We discuss romanization strategies, and present data indicating that this sort of romanization typically amounts to a rough phonetic transcription. We present Gboard keyboards that make use of models very similar to widely used grapheme-to-phoneme models. We also discuss language modeling of romanized text directly.
Brian Roark is a computational linguist working on various topics in natural language processing. His research interests include: syntactic parsing of text and speech; language modeling for automatic speech recognition and other applications; supervised and unsupervised learning of language and parsing models; text entry, accessibility and augmentative & alternative communication (AAC).
Before joining Google as a research scientist in 2013, he was a faculty member for 9 years in the Center for Spoken Language Understanding (CSLU) at Oregon Health & Science University (OHSU) – part of what used to be the Oregon Graduate Institute (OGI). Before that, he was in the Speech Algorithms Department at AT&T Labs – Research from 2001–2004. He received his PhD in the Department of Cognitive and Linguistic Sciences at Brown University in 2001.

Johns Hopkins University

Johns Hopkins University, Whiting School of Engineering

Center for Language and Speech Processing
Hackerman 226
3400 North Charles Street, Baltimore, MD 21218-2680

Center for Language and Speech Processing