Statistical Topic Models for Computational Social Science

Hanna Wallach, University of Massachusetts Amherst

April 5, 2011


Abstract

In order to draw data-driven conclusions, social scientists need quantitative tools for analyzing massive, complex collections of textual information. I will discuss the development of such tools. I will concentrate on a class of models known as statistical topic models, which automatically infer groups of semantically-related words (topics) from word co-occurrence patterns in documents, without requiring human intervention. The resultant topics can be used to answer a diverse range of research questions, including detecting and characterizing emergent behaviors, identifying topic-based communities, and tracking trends across languages. The foundation of statistical topic modeling is Bayesian statistics, which requires that assumptions, or prior beliefs, are made explicit. Until recently, most statistical topic models relied on two unchallenged prior beliefs. In this talk, I will explain how challenging these beliefs increases robustness to the skewed word frequency distributions common in text. I will also talk about recent work (with Rachel Shorey and Bruce Desmarais) on statistical topic models for studying temporal and textual patterns in formerly-classified government documents.

Biography

Hanna Wallach is an assistant professor in the Department of Computer Science at the University of Massachusetts Amherst. She is one of five core faculty members involved in UMass's newly-formed computational social science research initiative. Previously, Hanna was a postdoctoral researcher, also at UMass, where she developed Bayesian latent variable models for analyzing complex data regarding communication and collaboration within scientific and technological communities. Her recent work (with Ryan Adams and Zoubin Ghahramani) on infinite belief networks won the best paper award at AISTATS 2010. Hanna has co-organized multiple workshops on both computational social science and Bayesian latent variable modeling. Her tutorial on conditional random fields is widely referenced and used in machine learning courses around the world. As well as her research, Hanna works to promote and support women's involvement in computing. In 2006, she co-founded the annual workshop for women in machine learning, in order to give female faculty, research scientists, postdoctoral researchers, and graduate students an opportunity to meet, exchange research ideas, and build mentoring and networking relationships. In her not-so-spare time, Hanna is a member of Pioneer Valley Roller Derby, where she is better known as Logistic Aggression.