Arabic Vowel Restoration – John Henderson (MITRE)

November 12, 2002 all-day

View Seminar Video
Arabic speech recognizers are frequently designed to produce output without any short vowels because readers of Arabic do not require the diacritics that indicate short vowels. This design also allows the speech recognizers to utilize the millions of words of available non-diacritized Arabic text for language model training. Unwritten vowels are also left out of the pronunciation models. This forces the acoustic models to capture not only their intended targets, the non-short-vowel phonemes, but also the systematic interference of the unwritten short vowels.
I will detail data-driven approaches to Arabic vowel restoration explored during the 2002 Hopkins summer workshop and the effects they have on speech recognition systems for Arabic. Specifically, I will show that an Arabic ASR system that is trained on the output of an automatic vowel restoration system has lower word error rate than an ASR system trained with implicit disregard for the unwritten portions of the words.

John Henderson received a B.S. in Math/CS from Carnegie Mellon University in 1994, and a PhD from Johns Hopkins University in 2000 where he studied in the Natural Language Processing Laboratory. Since joining MITRE in 1999, he has been working on diverse topics such as designing annotation standards, named entity recognition, combining question-answering system outputs, recognizing variant forms of transliterated names, and out-of-vocabulary word repair for ASR systems. His current research includes machine translation of fixed point concepts such as proper names, times, and uniquely-specified artifacts, evaluation of MT systems, and other topics that lie in the intersections of MT, NLP, and ASR.

Center for Language and Speech Processing