Unsupervised Learning of Natural Language Structure – Dan Klein (Berkeley)

November 2, 2004 all-day

View Seminar Video
There is precisely one complete language processing system to date: the human brain. Though there is debate on how much built-in bias human learners might have, we definitely acquire language in a primarily unsupervised fashion. On the other hand, computational approaches to language processing are almost exclusively supervised, relying on hand-labeled corpora for training. This reliance is largely due to repeated failures of unsupervised approaches. In particular, the problem of learning syntax (grammar) from completely unannotated text has received a great deal of attention for well over a decade, with little in the way of positive results. We argue that previous methods for this task have generally failed because of the representations they used. Overly complex models are easily distracted by non-syntactic correlations (such as topical associations), while overly simple models aren’t rich enough to capture important first-order properties of language (such as directionality, adjacency, and valence). We describe several syntactic representations which are designed to capture the basic character of natural language syntax as directly as possible. With these representations, high-quality parses can be learned from surprisingly little text, with no labeled examples and no language-specific biases. Our results are the first to show above-baseline performance in unsupervised parsing, and far exceed the baseline (in multiple languages). These specific grammar learning methods are useful since parsed corpora exist for only a small number of languages. More generally, most high-level NLP tasks, such as machine translation and question-answering, lack richly annotated corpora, making unsupervised methods extremely appealing, even for common languages like English.

Dan Klein is an assistant professor of computer science at UC Berkeley, having recently completed his doctoral work at Stanford University. He holds a BA from Cornell University (summa cum laude in computer science, linguistics, and math) and a masters in linguistics from Oxford University. Professor Kleins research focuses on natural language processing, including unsupervised grammar induction, statistical parsing methods, and information extraction. His academic honors include a British Marshall Fellowship, several graduate research fellowships, and best paper awards at the ACL and EMNLP conferences.

Center for Language and Speech Processing