Scalable Topic Models and Applications to Machine Translation – Ke Zhai (University of Maryland, College Park)

When:
April 8, 2014 all-day
2014-04-08T00:00:00-04:00
2014-04-09T00:00:00-04:00
Where:
3400 N Charles St
Baltimore, MD 21218
USA

Abstract
Topic models are powerful tools for statistical analysis in text processing. Despite their success, application to large datasets is hampered by scaling inference to large parameter spaces. In this talk, we describe two ways to speed up topic models: parallelization and streaming. We propose a scalable and flexible implementation using variational inference on MapReduce. We further demonstrate two extensions of this model: using informed priors to incorporate word correlations, and extracting topics from a multilingual corpus. An alternative approach to achieve scalability is streaming, where the algorithm sees a small part of data at a time and update the model gradually. Although many streaming algorithms have been proposed for topic models, they all overlook a fundamental but challenging problem—the vocabulary is constantly evolving over time. We propose an online topic models with infinite vocabulary, which address the missing piece, and show that our algorithm is able to discover new words and refine topics on the fly. In addition, we also examine how topic models are helpful in acquiring domain knowledge and improving machine translation.

Biography

Ke Zhai is a PhD candidate in Department of Computer Science, University of Maryland, College Park, working with Prof. Jordan Boyd-Graber. He is expected to receive his PhD degree in Fall 2014. He works in the area of Machine Learning and Natural Language Processing, with an additional focus on the scalability and cloud computing. He also worked on several research projects on applying probabilistic Bayesian models in the area of image processing and dialogue modelling. He had open-sourced some libraries, including Mr. LDA, which is a package for large-scale topic modeling and has been adopted in research and industry.

Center for Language and Speech Processing