Automatic Speech Processing by Inference in Generative Models – Sam Roweis (University of Toronto)

March 9, 2004 all-day

View Seminar Video
Say you want to perform some complex speech processing task. How should you develop the algorithm that you eventually use? Traditionally, you combine inspiration, carefully examination of previous work, and arduous trial-and-error to invent a sequence of operations to apply to the waveform. But there is another approach: dream up a “generative model” –a probabilistic machine which outputs data in the same form as your data–in which the key quantities that you would eventually like to compute appear as hidden (latent) variables. Now perform inference in this model, estimating the hidden quantities. In doing so, the rules of probability will derive for you, automatically, a signal processing algorithm. While inference is well known to the speech community as a decoding step for HMMs, exactly the same type of calculation can be performed in many other models not related to recognition. In this talk, I will give several examples of this paradigm, showing how inference in very simple models can be used to perform surprisingly complex speech processing tasks including denoising, source separation, pitch tracking, timescale modification and estimation of articulatory movements from audio. In particular, I will introduce the factorial-max vector quantization (MAXVQ) model, motivated by the astonishing max approximation to log spectrograms of mixtures, show that it can be used with an efficient branch-and-bound technique for exact inference to perform both additive denoising and monaural separation. I will also describe a purely time domain approach to pitch processing which identifies waveform samples at the boundaries between glottal pulse periods (in voiced speech) or at the boundaries between unvoiced segments. An efficient algorithm for inferring these boundaries is derived from a simple probabilistic generative model for segments, which gives excellent results on pitch tracking, voiced/unvoiced detection and timescale modification.

Center for Language and Speech Processing