Brian Roark (Google AI): Romanization, non-standard orthography and text entry
Abstract
In this talk, we present issues in natural language modeling for text entry in languages that use noisy (i.e., non-standard) romanization strategies, with a particular focus on languages using Indic scripts. We discuss romanization strategies, and present data indicating that this sort of romanization typically amounts to a rough phonetic transcription. We present Gboard keyboards that make use of models very similar to widely used grapheme-to-phoneme models. We also discuss language modeling of romanized text directly.
Bio
Brian Roark is a computational linguist working on various topics in natural language processing. His research interests include: syntactic parsing of text and speech; language modeling for automatic speech recognition and other applications; supervised and unsupervised learning of language and parsing models; text entry, accessibility and augmentative & alternative communication (AAC).
Before joining Google as a research scientist in 2013, he was a faculty member for 9 years in the Center for Spoken Language Understanding (CSLU) at Oregon Health & Science University (OHSU) – part of what used to be the Oregon Graduate Institute (OGI). Before that, he was in the Speech Algorithms Department at AT&T Labs – Research from 2001–2004. He received his PhD in the Department of Cognitive and Linguistic Sciences at Brown University in 2001.