John Hershey (MERL) “Speech Separation by Deep Clustering: Towards Intelligent Audio Analysis and Understanding”
3400 N Charles St
Baltimore, MD 21218
We address the problem of acoustic source separation in a deep learning framework we call “deep clustering.” Deep learning has recently produced major improvements in speech enhancement tasks in which the speech and interference belong to distinct classes of signal. In this case, a deep network classifier labels time-frequency regions of the signal according to the class of the dominant source, and separation is achieved by reconstructing the corresponding regions. However, such classification-based approaches completely fail to learn in “cocktail party” scenarios, where the interference is also speech. We present an alternative method that generates relation-preserving embedding vectors, one for each time-frequency region of the spectrogram, such that their distances represents the graph structure of the desired solution. For speech separation, the graph defines the segmentation of the spectrogram into regions corresponding to each source, and its representation is decoded by clustering the embeddings. The embedded representation is thus flexible with respect to the number of clusters and is invariant to their permutations. This method can be compared to spectral clustering, which uses simple kernel features to represent high-rank affinities and decodes them using expensive spectral methods. Deep clustering instead uses powerful learned features to represent low-rank affinities that can be decoded using simple clustering methods. We present experiments showing speaker-independent separation of single channel speech mixtures that yields an astounding 10 dB average improvement in SNR to both speech signals after training on 30 hours of speech data. Even more surprisingly, the same model trained only on two speaker mixtures can separate three-speaker mixtures, indicating an unusual degree of generalization. An audio demonstration of the results will be given and future directions will be discussed.
Prior to joining MERL in 2010, John spent 5 years at IBM’s T.J. Watson Research Center in New York, where he led a team in noise robust speech recognition. He also spent a year as a visiting researcher in the speech group at Microsoft Research, after obtaining his Ph D from UCSD in the area of multi-modal machine perception. He is currently working on machine learning for signal separation, speech recognition, language processing, and adaptive user interfaces.