The Prague Dependency Treebank – Jan Hajic (Charles University)

March 25, 1997 all-day

Following the Prague tradition, dependency-based description of the formal representation of language structures has become the basis for building an annotated Czech corpus (size approx. 1 mil. words). The resulting corpus will also become part of the Czech National Corpus, which currently holds about 35 mil. words and will reach 100 mil. words by the end of 1998.During the talk, all three levels of annotation of the PDT will be briefly explained (morphological, analytical, tectogrammatical). Then, the talk will concentrate of the analytical level which is currently being worked on. The principles of annotation will be presented in detail together with some interesting phenomena and the rules for their representation and annotation, such as multiword expressions, non-continuous sentence constituents, incomplete sentences, ellipsis, coordination, parenthesis, etc.The software used for automatic preprocessing of the input text as well as the hand-annotation support software will be described and demonstrated.

Center for Language and Speech Processing