Probabilistic Models for Clustering Natural Language Data

Dr. John Lafferty of  the School of Computer Science, Carnegie Mellon University
at the CLSP/JHU Summer Research Workshop on August 5, 1998 at 10:30 am, Arellano Theater, Levering Hall.
 Probabilistic Models for Clustering Natural Language Data


Clustering algorithms are important for a wide variety of tasks in information retrieval, natural language and speech processing.  We will present a new language modeling approach to unsupervised clustering of words, constituents, and documents.  Our probabilistic method offers advantages over more algebraic techniques such as latent semantic analysis, and improves upon the word clustering algorithms currently used in speech and language processing applications.  Our approach is to bootstrap a language model by simultaneous word and document clustering.
The basic idea is that word clustering improves document classes by reducing the dimension to obtain more robust statistical estimates, and document clustering improves word classes by introducing a topic dependence that leads to rich semantic classes.  We'll present results on the Switchboard corpus of conversational speech transcripts that  demonstrate the strengths of our approach.  We will also describe our current research on applying this methodology to event detection in the TDT study, and on adapting the technique to dynamic clustering of named-entities and events to enable interactive browsing of large digital video libraries.  This is joint work with Peter Venable at CMU.
 

Biographical sketch: John Lafferty obtained the Ph.D. degree in Mathematics from Princeton University in 1986, and was a member of the Program in Applied and Computational Mathematics at Princeton.  After teaching briefly at Harvard University, he joined the IBM Thomas J. Watson Research Center in Yorktown Heights, NY, as a Research Staff Member, where he began working on statistical approaches to language processing.  Since 1994 he has been a member of the faculty of the School of Computer Science at Carnegie Mellon University, where he is currently an Associate Professor in the Computer Science Department and the Language Technologies Institute, and an affiliated faculty member of the Center for Automated Learning and Discovery, and the Program in Algorithms, Combinatorics and Optimization.  His research interests include statistical learning algorithms, speech and natural language processing, and coding and information theory.  Dr. Lafferty is a member of the Speech Technical Committee of the IEEE Signal Processing Society, and received an IBM University Partnership Award for faculty development
in 1998.
 
Click Here for his Presentation