Geometric and Event-Based Approaches to Speech Representation and Recognition – Aren Jansen (University of Illinois)

August 29, 2009 all-day

Anyone who has used an automatic speech recognition (ASR) system, either on a customer support line or on their own personal computer, knows firsthand there is vast room for improvement. While state-of-the-art commercial systems perform very well in near-ideal environments, system robustness remains far below human levels. The prevailing hidden Markov model (HMM) based paradigm will undoubtedly see gains in future decades as increased computing capacity admits more complex acoustic models that encompass a range of acoustic environments. In the meantime, there is a wealth of scientific understanding of production and perceptual mechanisms that has yet to be fully exploited by engineers and technologists. In this talk, I will present the main results of a research program that takes scientific inspiration from linguistics, speech perception, and neuroscience as starting points for alternative directions in automatic speech recognition. First, I consider the implications speech production have on the geometric structure of speech sounds and the role this perspective can play in speech technology. Second, I consider the hypothesis that the linguistic content underlying human speech may be more efficiently and robustly coded in the pattern of timings of various acoustic events (landmarks) present in the speech signal. I will present a point process-based statistical framework for phonetic recognition and keyword spotting that matches the performance of equivalent frame-based systems. This approach suggests a new unsupervised adaptation strategy for improving recognizer robustness that outperforms maximum likelihood linear regression adaptation of a continuous density keyword-filler HMM system.
Aren Jansen accepted a position of Senior Research Scientist at the Center of Excellence in Human Language Technology at JHU and is a candidate for a position of a Research Assistant Professor at the ECE department at JHU. He received the B.A. degree in physics from Cornell University in 2001. He received the M.S. degree in physics as well as the M.S. and Ph.D. degrees in computer science from the University of Chicago in 2003, 2005, and 2008, respectively, and has undertaken postdoctoral work at the University of Chicago. His research centers around exploring the interface of knowledge and statistical-based approaches to speech representation and recognition.

Center for Language and Speech Processing