Improving the Accuracy, Efficiency and Data Use for Natural Language Parsing – Shay Cohen (Columbia University)

March 8, 2013 all-day

We are facing enormous growth in the amount of information available from various data resources. This growth is even more notable when it comes to text data; the number of pages on the internet, for example, is expected to double itself every five years, with billions of multilingual webpages already available.
In order to make use of this textual data in natural language understanding systems, we need to rely on text analysis that structures this information. Natural language parsing is one such example, a fundamental problem in NLP. it provides the basic structure to text, representing its syntax computationally. This structure is used in most NLP applications that analyze language to understand meaning.
I will discuss the three important facets of modeling syntax: (a) accuracy of learning; (b) efficiency of parsing unseen sentences; and (c) selection of data to learn from. In this talk, the common theme of these three ideas is the concept of learning from incomplete data. To model syntax more effectively, I will first describe a model called latent-variable probabilistic context-free grammars (L-PCFGs) which, because of the hardness of learning from incomplete data, has until recently been used for learning in tandem with many heuristics and approximations. I will show a much more principled and statistically consistent approach to learning L-PCFGs using spectral algorithms, and will also show how L-PCFGs can parse unseen sentences much more efficiently through the use of tensor decomposition.
In addition, I will touch on work with unsupervised language learning, one of the holy grails of NLP, in the Bayesian setting. In this setting, priors are used to guide the learner, compensating for the lack of labeled data. I will survey novel priors that were developed for this setting, and mention how they can be used monolingually and multilingually.
Shay Cohen is a postdoctoral research scientist in the Department of Computer Science at Columbia University. He holds a CRA Computing Innovation Fellowship. He received his B.Sc. and M.Sc. from Tel Aviv University in 2000 and 2004, and his Ph.D. from Carnegie Mellon University in 2011. His research interests span a range of topics in natural language processing and machine learning, with a focus on structured prediction. He is especially interested in developing efficient and scalable parsing algorithms as well as learning algorithms for probabilistic grammars.

Center for Language and Speech Processing