Single-Channel Mixed Speech Recognition Using Deep Neural Networks – Dong Yu (Microsoft Research)
Baltimore, MD 21218
While significant progress has been made in improving the noise robustness of speech recognition systems, recognizing speech in the presence of a competing talker remains one of the most challenging unsolved problems in the field. In this talk, I will present our first attempt in attacking this problem using deep neural networks (DNNs). Our approach adopted a multi-style training strategy using artificially mixed speech data. I will discuss the strengths and weaknesses of several different setups that we have investigated including a WFST-based two-talker decoder to work with the trained DNNs. Experiments on the 2006 speech separation and recognition challenge task demonstrate that the proposed DNN-based system has remarkable robustness to the interference of a competing speaker. The best setup of our proposed systems achieves an overall WER of 18.8% which improves upon the results obtained by the state-of-the-art IBM superhuman system by 2.8% absolute, with fewer assumptions.
Dr. Dong Yu is a principal researcher at the Microsoft speech and dialog research group. His research interests include speech processing, robust speech recognition, discriminative training, and machine learning. He has published over 130 papers in these areas and is the co-inventor of more than 50 granted/pending patents. His recent work on the context-dependent deep neural network hidden Markov model (CD-DNN-HMM) has been shaping the direction of research on large vocabulary speech recognition and was recognized by the IEEE SPS 2013 best paper award.