New Waves of Innovation in Large-Scale Speech Technology Ignited by Deep Learning – Li Deng (Microsoft Research)

October 23, 2012 all-day

View Seminar Video
Semantic information embedded in the speech signal manifests itself in a dynamic process rooted in the deep linguistic hierarchy as an intrinsic part of the human cognitive system. Modeling both the dynamic process and the deep structure for advancing speech technology has been an active pursuit for over more than 20 years, but it is only within past two years that technological breakthrough has been created by a methodology commonly referred to as “deep learning”. Deep Belief Net (DBN) and the related deep neural nets are recently being used to supersede the Gaussian mixture model component in HMM-based speech recognition, and has produced dramatic error rate reduction in both phone recognition and large vocabulary speech recognition of industry scale while keeping the HMM component intact. On the other hand, the (constrained) Dynamic Bayesian Networks have been developed for many years to improve the dynamic models of speech aimed to overcome the IID assumption as a key weakness of the HMM, with a set of techniques commonly known as hidden dynamic/trajectory models or articulatory-like segmental representations. A history of these two largely separate lines of research will be critically reviewed and analyzed in the context of modeling the deep and dynamic linguistic hierarchy for advancing speech recognition technology. The first wave of innovation has successfully unseated Gaussian mixture model and MFCC-like features — two of the three main pillars of the 20-year-old technology in speech recognition. Future directions will be discussed and analyzed on supplanting the final pillar — HMM — where frame-level scores are to be enhanced to dynamic-segment scores through new waves of innovation capitalizing on multiple lines of research that has enriched our knowledge of the deep, dynamic process of human speech.
Li Deng received the Ph.D. from Univ. Wisconsin-Madison. He was an Assistant (1989-1992), Associate (1992-1996), and Full Professor (1996-1999) at the University of Waterloo, Ontario, Canada. He then joined Microsoft Research, Redmond, where he is currently a Principal Researcher and where he received Microsoft Research Technology Transfer, Goldstar, and Achievement Awards. Prior to MSR, he also worked or taught at Massachusetts Institute of Technology, ATR Interpreting Telecom. Research Lab. (Kyoto, Japan), and HKUST. He has published over 300 refereed papers in leading journals/conferences and 3 books covering broad areas of human language technology and machine learning. He is a Fellow of the Acoustical Society of America, a Fellow of the IEEE, and a Fellow of the International Speech Communication Association. He is an inventor or co-inventor of over 50 granted US, Japanese, or international patents. Recently, he served as Editor-in-Chief for IEEE Signal Processing Magazine (2009-2011), which ranked first in year 2010 and 2011 among all 247 publications within the Electrical and Electronics Engineering Category worldwide in terms of its impact factor, and for which he received the 2011 IEEE SPS Meritorious Service Award. He currently serves as Editor-in-Chief for IEEE Transactions on Audio, Speech and Language Processing. His technical work over the past three years brought the power of deep learning into the speech recognition and signal processing fields.

Center for Language and Speech Processing