Scalable Topic Models – David Blei (Princeton University)

January 31, 2012 all-day

View Seminar Video
Probabilistic topic modeling provides a suite of tools for analyzing large collections of documents. Topic modeling algorithms can uncover the underlying themes of a collection and decompose its documents according to those themes. We can use topic models to explore the thematic structure of a corpus and to solve a variety of prediction problems about documents.At the center of a topic model is a hierarchical mixed-membership model, where each document exhibits a shared set of mixture components with individual (per-document) proportions. Our goal is to condition on the observed words of a collection and estimate the posterior distribution of the shared components and per-document proportions. When analyzing modern corpora, this amounts to posterior inference with billions of latent variables.How can we cope with such data? In this talk, I will describe stochastic variational inference, an algorithm for computing with topic models that can handle very large document collections and even endless streams of documents. I will demonstrate the algorithm with models fitted to millions of articles. I will show how stochastic variational inference can be generalized to many kinds of hierarchical models. I will highlight several open questions and outstanding issues.(This is joint work with Francis Bach, Matt Hoffman, John Paisley, and Chong Wang.)
David Blei is an associate professor of Computer Science at Princeton University. His research interests include probabilistic topic models, graphical models, approximate posterior inference, and Bayesian nonparametrics.

Center for Language and Speech Processing