Clustering algorithms are important for a wide variety of tasks in information
retrieval, natural language and speech processing. We will present
a new language modeling approach to unsupervised clustering of words, constituents,
and documents. Our probabilistic method offers advantages over more
algebraic techniques such as latent semantic analysis, and improves upon
the word clustering algorithms currently used in speech and language processing
applications. Our approach is to bootstrap a language model by simultaneous
word and document clustering.
The basic idea is that word clustering improves document classes by
reducing the dimension to obtain more robust statistical estimates, and
document clustering improves word classes by introducing a topic dependence
that leads to rich semantic classes. We'll present results on the
Switchboard corpus of conversational speech transcripts that demonstrate
the strengths of our approach. We will also describe our current
research on applying this methodology to event detection in the TDT study,
and on adapting the technique to dynamic clustering of named-entities and
events to enable interactive browsing of large digital video libraries.
This is joint work with Peter Venable at CMU.
Biographical sketch: John Lafferty obtained the Ph.D. degree
in Mathematics from Princeton University in 1986, and was a member of the
Program in Applied and Computational Mathematics at Princeton. After
teaching briefly at Harvard University, he joined the IBM Thomas J. Watson
Research Center in Yorktown Heights, NY, as a Research Staff Member, where
he began working on statistical approaches to language processing.
Since 1994 he has been a member of the faculty of the School of Computer
Science at Carnegie Mellon University, where he is currently an Associate
Professor in the Computer Science Department and the Language Technologies
Institute, and an affiliated faculty member of the Center for Automated
Learning and Discovery, and the Program in Algorithms, Combinatorics and
Optimization. His research interests include statistical learning
algorithms, speech and natural language processing, and coding and information
theory. Dr. Lafferty is a member of the Speech Technical Committee
of the IEEE Signal Processing Society, and received an IBM University Partnership
Award for faculty development
in 1998.
Click Here for his Presentation