Alpino: Wide-coverage Computational Analysis of Dutch – Gertjan van Noord (University of Groningen)

April 9, 2002 all-day

Alpino is a wide-coverage computational analyzer of Dutch which aims at accurate, full, parsing of unrestricted text. Alpino is based on a constructionalist HPSG grammar including a large lexical component. Alpino produces dependency structures, as proposed in the CGN (Corpus of Spoken Dutch). Important aspects of wide-coverage parsing are robustness, efficiency and disambiguation. In the talk we briefly introduce the Alpino system, and then discuss two recent developments. The first development is the integration of a log-linear model for disambiguation. It is shown that this model performs well on the task, despite the small size of the training data that is used to train the model. We also describe how we avoid the inherent efficiency problems of using such a log-linear model in parse selection. The second development concerns the implementation of an unsupervised POS-tagger. It is shown that a simple POS-tagger can be used to filter the results of lexical analysis of a wide-coverage computational grammar. The reduction of the number of lexical categories not only greatly improves parsing efficiency, but in our experiments also gave rise to a mild increase in parsing accuracy; in contrast to results reported in earlier work on supervised tagging. The novel aspect of our approach is that the POS-tagger does not require any human-annotated data – but rather uses the parser output obtained on a large training set.

Center for Language and Speech Processing