Rick Rose (Google) “Multimodal audio-visual speech processing at Google”
Abstract: The increased availability of high resolution cameras and array microphones in live meetings, video production, and camera enabled assistant devices has created opportunities for exploiting multiple modalities in speech applications. This presentation summarizes initial work at Google in fusing audio and visual information to improve the performance of speech recognition and speaker tracking. We show that multimodal approaches provide significant improvement in both speech recognition and speaker diarization especially under noisy conditions. However, these gains are not always robust to missing modalities, and there is considerable work to be done to make audio/visual speech processing practical. Results from our initial multimodal ASR and speaker diarization experiments will be presented.
Bio: Rick Rose has been a research scientist at Google in New York City since October, 2014. While at Google he has contributed to efforts in far-field speech recognition, acoustic modeling for ASR, speaker diarization, and audio-visual speech processing. Before coming to Google, he served as a Professor of Electrical and Computer Engineering at McGill University in Montreal since 2004, as a member of research staff at AT&T Labs / Bell Labs, and member of staff at MIT Lincoln Labs. He received his PhD degree in Electrical Engineering from the Georgia Institute of Technology. He has been active in the IEEE Signal Processing Society and is an IEEE Fellow.