Learning Constraint-Based Grammars from Representative Data – Smaranda Muresan (UMD)
View Seminar Video
Abstract
Traditional natural language processing systems have focused on modeling the deep, human-like level of text understanding, by integrating syntax and semantics. However, they overlooked a key requirement for scalability: learning. Modern natural language systems on the other hand, have embraced learning methods to ensure scalability, but they remain at a shallow level of text understanding by their inability to successfully model semantics. In this talk I will present a computationally efficient model for deep language understanding that brings together syntax, semantics and learning. I will present a new grammar formalism, Lexicalized Well-Founded Grammar, which integrates syntax and semantics, and is learnable from a small set of representative annotated examples, defining the importance to the model linguistically, and not simply by frequency, as in most previous work. The grammar rules have compositional and ontology constraints that provide access to meaning during parsing. The semantic representation is an ontology query language which allows a deep-level text-to-knowledge acquisition. I have proven that under appropriate assumptions the search space for grammar learning is a complete grammar lattice, which guarantees the uniqueness of the solution. I will show the linguistic relevance of a practical LWFG learning framework and its utility for populating terminological knowledge bases from text in the medical domain.
Biography
Smaranda Muresan received her PhD degree in Computer Science from Columbia University. She is currently a Postdoctoral Research Associate at the Institute for Advanced Computer Studies at University of Maryland. Her research interests include language learning and understanding, machine translation and relational learning. Her work unifies two separate but central themes in human language technologies: computational formalisms to express language phenomena and induction of knowledge from data.