New Methods to Capture and Exploit Multiscale Speech Dynamics: From Mathematical Models to Forensic Tools
Patrick Wolfe, Statistics and Information Sciences Laboratory (SISL), Harvard University
October 9, 2007
The variability inherent in speech waveforms gives rise to powerful temporal and spectral dynamics that evolve across multiple scales, and in this talk we describe new methods to capture and exploit these multiscale dynamics. First we consider the canonical task of formant estimation, formulated as a statistical model-based tracking problem. We extend a recent model of Deng et al. both to account for the uncertainty of speech presence by way of a censored likelihood formulation, as well as to explicitly model formant cross-correlation via a vector autoregression. Our results indicate an improvement of 20-30% relative to benchmark formant analysis tools. In the second part of the talk we present a new adaptive short-time Fourier analysis-synthesis scheme for signal analysis, and demonstrate its efficacy in speech enhancement. While a number of adaptive analyses have previously been proposed to overcome the limitations of fixed time-frequency resolution schemes, we derive here a modified overlap-add procedure that enables efficient resynthesis of the speech waveform. Measurements and listening tests alike indicate the potential of this approach to yield a clear improvement over fixed-resolution enhancement systems currently used in practice.
Patrick J. Wolfe is currently Assistant Professor of Electrical Engineering in the School of Engineering and Applied Sciences at Harvard, with appointments in the Department of Statistics and the Harvard-MIT Program in Speech and Hearing Biosciences and Technology. He received a B.S. in Electrical Engineering and a B.Mus. concurrently from the University of Illinois at Urbana-Champaign in 1998, both with honors. He then earned his Ph.D. in Engineering from the University of Cambridge (UK) as an NSF Graduate Research Fellow, working on the application of perceptual criteria to statistical audio signal processing. Prior to founding the Statistics and Information Sciences Laboratory at Harvard in 2004, Professor Wolfe held a Fellowship and College Lectureship jointly in Engineering and Computer Science at New Hall, a University of Cambridge consituent college where he also served as Dean. He has also taught in the Department of Statistical Science at University College, London, and continues to act as a consultant to the professional audio community in government and industry. At Harvard he teaches a variety of courses on advanced topics in inference, information, and statistical signal processing, as well as applied mathematics and statistics at the undergraduate level. In addition to his diverse teaching activities, Professor Wolfe has published in the literatures of engineering, computer science, and statistics, and has received honors from the IEEE, the Acoustical Society of America, and the International Society for Bayesian Analysis. His research group focuses on statistical signal processing for modern high-dimensional data sets such as speech waveforms and color images, and is supported by a number of grants and partnerships, including sponsored projects with NSF, DARPA, and Sony Electronics, Inc. Recent research highlights include a paper award at the 2007 IEEE International Conference on Image Processing for work in color image acquisition, a new approach to speech formant tracking that yields up to 30% improvement relative to benchmark methods, and a set of matrix approximation techniques for spectral methods in machine learning, with error bounds that improve significantly upon known results.