Tara Sainath (Google Research) “End-to-End Modeling for Speech Recognition”

February 5, 2018 @ 12:00 pm – 1:15 pm
Hackerman Hall B17
3400 N Charles St
Baltimore, MD 21218


Traditional automatic speech recognition (ASR) systems are comprised of an acoustic model (AM), a pronunciation model (PM) and a language model (LM), all of which are independently trained, and often manually designed, on different datasets. Over the last several years, there has been a growing popularity in developing end-to-end systems, which attempt to learn these separate components jointly as a single system. While these end-to-end models have shown promising results in the literature, it is not yet clear if such approaches can improve on current state-of-the-art conventional systems. In this talk, I will discuss various algorithmic and systematic improvements we have explored in developing a new end-to-end model that surpasses the performance of a conventional production system. I will also discuss promising results with multi-lingual and multi-dialect end-to-end models. Finally, I will discuss current challenges with these models and future research directions.


Tara Sainath received her PhD in Electrical Engineering and Computer Science from MIT in 2009. The main focus of her PhD work was in acoustic modeling for noise robust speech recognition. After her PhD, she spent 5 years at the Speech and Language Algorithms group at IBM T.J. Watson Research Center, before joining Google Research. She has served as a Program Chair for ICLR in 2017 and 2018. Also, she has co-organized numerous special sessions and workshops, including Interspeech 2010, ICML 2013, Interspeech 2016 and ICML 2017. In addition, she is a member of the IEEE Speech and Language Processing Technical Committee (SLTC) as well as the Associate Editor for IEEE/ACM Transactions on Audio, Speech, and Language Processing. Her research interests are mainly in acoustic modeling, including deep neural networks, sparse representations and adaptation methods.

Center for Language and Speech Processing