Finding Acoustic Regularities in Speech: From Words to Segments – Jim Glass (MIT)
View Seminar Video
The development of an automatic speech recognizer is typically a highly supervised process involving the specification of phonetic inventories, lexicons, acoustic and language models, along with annotated training corpora. Although some model parameters may be modified via adaptation, the overall structure of the speech recognizer remains relatively static thereafter. While this approach has been effective for problems when there is adequate human expertise and labeled corpora, it is challenged by less-supervised or unsupervised scenarios. It also stands in stark contrast to human processing of speech and language where learning is an intrinsic capability. From a machine learning perspective, a complementary alternative is to discover unit inventories in an unsupervised manner by exploiting the structure of repeating acoustic patterns within the speech signal. In this work we use pattern discovery methods to automatically acquire lexical entities, as well as speaker and topic segmentations directly from an untranscribed audio stream. Our approach to unsupervised word acquisition utilizes a segmental variant of a widely used dynamic programming technique, which allows us to find matching acoustic patterns between spoken utterances. By aggregating information about these matching patterns across audio streams, we demonstrate how to group similar acoustic sequences together to form clusters corresponding to lexical entities such as words and short multi-word phrases. On a corpus of lecture material, we demonstrate that clusters found using this technique exhibit high purity and that many of the corresponding lexical identities are relevant to the underlying audio stream. We have applied the acoustic pattern matching and clustering methods to several important problems in speech and language processing. In addition to showing how this methodology applies across different languages, we demonstrate two methods to automatically determine the identify of speech clusters. Finally, we also show how it can be used to provide an unsupervised segmentation of speakers and topics. Joint work with Alex Park, Igor Malioutov, and Regina Barzilay.
James R. Glass obtained his S.M. and Ph.D. degrees in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology. He is currently a Principal Research Scientist at the MIT Computer Science and Artificial Intelligence Laboratory where he heads the Spoken Language Systems Group. He is also a Lecturer in the Harvard-MIT Division of Health Sciences and Technology. His primary research interests are in the area of speech communication and human-computer interaction, centered on automatic speech recognition and spoken language understanding.