Sparse Models of Lexical Variation – Jacob Eisenstein (Carnegie Mellon University)
View Seminar Video
Text analysis involves building predictive models and discovering latent structures in noisy and high-dimensional data. Document classes, latent topics, and author communities are often distinguished by a small number of trigger words or phrases — needles in a haystack of irrelevant features. In this talk, I describe generative and discriminative techniques for learning sparse models of lexical differences. First, I show how multi-task regression with structured sparsity can identify a small subset of words associated with a range of demographic attributes in social media, yielding new insights about the complex multivariate relationship between demographics and lexical choice. Second, I present SAGE, a novel approach to sparsity in generative models of text, in which we induce sparse deviations from background log probabilities. As a generative model, SAGE can be applied across a range of supervised and unsupervised applications, including classification, topic modeling, and latent variable models.
Jacob Eisenstein is a postdoctoral fellow in the Machine Learning Department at Carnegie Mellon University. His research focuses on machine learning for social media analysis, discourse, and non-verbal communication. Jacob completed his Ph.D. at MIT in 2008, winning the George M. Sprowls dissertation award. In January 2012, Jacob will join Georgia Tech as an Assistant Professor in the School of Interactive Computing.