Integrating history-length interpolation and classes in language modeling
Hinrich Schuetze, University of Stuttgart
April 19, 2011
Building on earlier work that integrates different factors in language modeling, we view (i) backing off to a shorter history and (ii) class-based generalization as two complementary mechanisms of using a larger equivalence class for prediction when the default equivalence class is too small for reliable estimation. This view entails that the classes in a language model should be learned from rare events only and should be preferably applied to rare events. We construct such a model and show that both training on rare events and preferable application to rare events improve perplexity when compared to a simple direct interpolation of class-based with standard language models.
Hinrich Schuetze is a professor of computational linguistics in the School of Computer Science and Electrical Engineering at the Unversity of Stuttgart in Germany. He received his PhD in linguistics from Stanford University in 1995 and worked in the areas of text mining and information retrieval at a number of research institutions and startups in Silicon Valley until 2004. His research focuses on natural language processing problems that are important for applications like information retrieval and machine translation and at the same time contribute to our fundamental understanding of language as a cognitive phenomenon. He is a coauthor of Foundations of Statistical Natural Language Processing (MIT Press, with Chris Manning) and Introduction to Information Retrieval (Cambridge University Press, with Chris Manning and Prabhakar Raghavan).