Far-Field Enhancement and Recognition in Mismatched Settings


Based on the recent success of Automatic speech recognition (ASR) for mobile applications, noise robustness of ASR in the real world has become a very important technical issue. ASR systems will soon be expected to function in a variety of conditions– gaming (Kinect), personal assistants (Amazon Echo), meeting recognition and distance wire-taps, to name a few. Traditional application scenarios tend to utilize the same microphone and channel conditions in training and at test time. Efforts so far have focused on developing techniques, such as microphone arrays, source separation, speech enhancement (SE), and ASR, that work in a given specific setting. Here the setting consists of the configuration, (i.e., the number of mics and their geometry), and the environment (i.e., room noise and reverberation). Such approaches tend to over-fit the system to the training setting, and do not generalize well to mismatched or unseen settings.

We propose tackling this challenging problem using cutting-edge machine learning techniques based around three themes. The first theme is the embedding of generative model-based strategies into a deep learning framework using deep unfolding [Hershey et al., 2014]. Conventional generative model strategies such as adaptation [Gales, 1998], uncertainty decoding [Barker et al., 2005], and variational inference [Rennie et al., 2010; Watanabe et. al., 2004], allow us to use physical problem constraints to guide an adaptation process, such as inferring the acoustic channel parameters, in order to generalize to new acoustic configurations. Nevertheless, we expect that such methods can be even more powerful when incorporated into a deep learning framework, in which the adaptation model itself can be discriminatively trained to produce more accurate estimates of the signals of interest. The second theme is the augmentation of training data based on existing databases to provide better coverage of unexpected conditions [Cui et al., 2014]. There are many factors of variation in far-field ASR, including noise types, microphone configuration, and room acoustics. By considering the acoustic and physiological constraints of the data generation, however, we can construct stochastic generative processes with few degrees of freedom from which we can efficiently sample multiple instances of training data, enbling multi-condition training. The third theme is the exploitation of multi-task learning methodologies [Seltzer & Droppo, 2013] for ASR and SE, now that, in the context of deep networks, the mathematical formalism to describe enhancement and ASR can be identical [Mohamed et al., 2012].


I. Probabilistic model-based methods, for example, 1) self-calibrating mic arrays, using ASR models, 2) dereverberation using model-based approaches e.g., [Nakatani et al., 2011], 3) model-based speech enhancement, non-negative matrix factorization (NMF) and its generalizations (e.g., multichannel NMF [Ozerov & Fevotte, 2010]). These methods may be loosely integrated with ASR via lattice-based methods [Mandel & Narayanan, 2014; Carmona et al., 2013] or tightly integrated inside the speech decoder, if possible.

II. Data augmentation. We will exploit several data augmentation techniques for deep networks to cover speaker variations based on linear transformations or vocal tract length mapping between speakers [Cui et al., 2014], and extend these ideas to multiple configurations and environments to generate an augmented training data set, increasing generalization of the models.

III. Deep network methods for integrating ASR acoustic modeling and enhancement, in a multi-task learning framework, for example, long short-term memory (LSTM) recurrent neural networks (RNNs), bi-directional LSTMs, convolutional networks, pooling across microphones, pooling across beam directions, and so on.

IV. Deep unfolding of model-based methods, a hybrid of I. and III. For example, we can derive a novel deep network architecture (as in III.) whose layers emulate the computations performed in the iterations of (for example) a variational algorithm for model-based noise and reverberation compensation (as in I.) using the framework in [Hershey et al., 2014].

Task design

Speech data will be embedded in continuous audio backgrounds with natural context and continuity constraints. We will combine existing distant-talk/noise-robust ASR tasks, having different configurations and environments, based on the CHiME series [Barker et al., 2013] (including new six-channel CHiME-3 data), AMI [Hain et al., 2006], REVERB [Kinoshita et al., 2013], and ASpIRE databases. We will also prepare augmented training data based on the data augmentation techniques described in II. Additional data can be recorded if necessary, for example, using instrumented meeting rooms and mobile mic arrays.

Software platform and outcome

We will assemble a publicly available state-of-the-art ASR baseline connecting several state-of-the-art SE techniques. SE techniques would include conventional tools such as beamforming, de-reverberation and echo cancellation, and advanced tools such as non-negative matrix factorization, spectrum mask estimation, and RNN based speech enhancement. ASR tools include Kaldi [Povey et al., 2011] for core training and decoding based on tandem bottleneck [Grezl et al., 2007] and DNN acoustic modeling with sequence training [Vesely et al., 2013]. MSR Computational Network Toolkit (CNTK) [Yu et al., 2014] and Theano [Bergstra et al., 2010] can be used for novel architectures including deep unfolding. The outcome of the project will include a far-field speech recognition toolkit and software for data augmentation.


Team Members
Team Leader
John HersheyMitsubishi Electric Research Laboratory
Senior Members
Jon BarkerSheffield University
Martin KarafiatBrno University of Technology
Michael MandelOhio State University
Shinji WatanabeMitsubishi Electric Research Laboratory
Graduate Students
Vijay PeddintiJohns Hopkins University
Pawel SwietojanskiEdinburgh University
Karel VeselyBrno University of Technology

Center for Language and Speech Processing