A Brief History of the Penn Treebank – Mitch Marcus (University of Pennsylvania)
The Penn Treebank, initially released in 1992, was the first richly annotated text corpus widely available within the natural language processing (NLP) community. Its release led within a few years to the development of the first competent English parsers, and helped spark the statistical revolution within NLP. Over the past 20 years, the Penn Treebank has become the de facto standard for training and test English parsers, and still plays this role nearly 2 decades after its release. This talk will briefly describe the Penn Treebank and its applications, then discuss the history of the Treebank’s development from Fred Jelinek’s first proposal of a treebank to DARPA in 1987 through our development of the Treebank from 1989 until the release of Treebank II in 1995. I will attempt to explain the Penn Treebank’s motivations and the process of creating it, perhaps explaining why it has some of its more peculiar properties.This talk describes joint work with Beatrice Santorini, Mary Ann Marcinkiewicz, Grace Kim, Ann Bies, and many others.
Mitchell Marcus is the RCA Professor of Artificial Intelligence in the Department of Computer and Information Science at the University of Pennsylvania. He was the principal investigator for the Penn Treebank Project through the mid-1990s; he and his collaborators continue to develop hand-annotated corpora for use world-wide as training materials for statistical natural language systems. Other research interests include: statistical natural language processing, human-robot communication, and cognitively plausible models for automatic acquisition of linguistic structure. He has served as chair of Penn’s Computer and Information Science Department, as chair of the Penn Faculty Senate, and as president of the Association for Computational Linguistics. He is also a Fellow of the American Association of Artificial Intelligence. He currently serves as chair of the Advisory Committee of the Center of Excellence in Human Language Technology at JHU, as well as serving as a member of the advisory committee for the Department of Computer and Information Science.