Fadi Biadsy and Dimitri Kanevsky (Google) “Speech Recognition: From Speaker Dependent to Speaker Independent to Full Personalization” “Parrotron: A Unified E2E Speech-to Speech Conversion and ASR Model for Atypical Speech”
3400 N. Charles Street
Baltimore
MD 21218
Abstract
Most people take for granted that when they speak, they will be heard and understood. But for the millions who live with speech impairments caused by physical or neurological conditions, trying to communicate with others can be difficult and lead to frustration. While there have been a great number of recent advances in Automatic Speech Recognition (ASR) technologies, these interfaces can be inaccessible for those with speech impairments.
In this talk, we will present Parrotron, an end-to-end-trained speech-to-speech conversion model that maps an input spectrogram directly to another spectrogram, without utilizing any intermediate discrete representation. The system is also trained to emit words in addition to a spectrogram, in parallel. We demonstrate that this model can be trained to normalize speech from any speaker regardless of accent, prosody, and background noise, into the voice of a single canonical target speaker with a fixed accent and consistent articulation and prosody. We further show that this normalization model can be adapted to normalize highly atypical speech from speakers with a variety of speech impairments (due to, ALS, Cerebral-Palsy, Deafness, Stroke, Brain Injury, etc.) , resulting in significant improvements in intelligibility and naturalness, measured via a speech recognizer and listening tests. Finally, demonstrating the utility of this model on other speech tasks, we show that the same model architecture can be trained to perform a speech separation task.
Dimitri will give a brief description of some key moments in development of speech recognition algorithms that he was involved in and their applications to YouTube closed captions, Live Transcribe and wearable subtitles.
Fadi will then speak about the development of Parrotron.
Biographies
Dimitri Kanevsky started his career at Google working on speech recognition algorithms. Prior to joining Google, Dimitri was a Research staff member in the Speech Algorithms Department at IBM. Prior to IBM, he worked at a number of centers for higher mathematics, including Max Planck Institute in Germany and the Institute for Advanced Studies in Princeton. He currently holds 295 US patents and was Master Inventor at IBM. MIT Technology Review recognized Dimitri conversational biometrics based security patent as one of five most influential patents for 2003. In 2012 Dimitri was honored at the White House as a Champion of Change for his efforts to advance access to science, technology, engineering, and math.
Fadi Biadsy is a senior staff research scientist at Google NY for the past ten years. He has been exploring and leading multiple projects at Google, including speech recognition, speech conversion, language modeling, and semantic understanding. He received his PhD from Columbia University in 2011. At Columbia, he researched a variety of speech and language processing projects including, dialect and accent recognition, speech recognition, charismatic speech and question answering. He holds a BSc and MSc in mathematics and computer science. He worked on handwriting recognition during his masters degree and he worked as a senior software developer for five years at Dalet digital media systems building multimedia broadcasting systems.