Core Natural Language Processing Technology Applicable to Multiple Languages

Syntactic analysis is one of the crucial ingredients of natural language understanding. When we hear a sentence such as “I saw John,” we identify saw as the main verb or event in the sentence, I as the subject doing the seeing and John as the object being seen. While this example is simple, things become complex very quickly as the sentence to be understood grows longer. A common problem that arises is ambiguity, e.g., in the sentence “I saw the man with a telescope,” either the seen man had a telescope or a telescope was used to see the man. Even for moderately long sentences, tens and hundreds of thousands of distinct analyses are possible. Yet automatic syntactic analysis based on statistical methods has been quite successful for English – state of the art parsers correctly extract 90% of the dependencies from newspaper text such as the Wall Street Journal. This is done by annotating a large number of sentences by hand and building a statistical model which estimates how frequently a particular analysis is encountered. This model then ranks the various analyses of a new sentence by likelihood, and efficiently computes the most likely one. Most state of the art parsing models heavily use lexical information in choosing an analysis. While much is known about parsing English text, it is easy to see that parsing a highly inflective language or a free word order language such as Czech adds a new dimension of difficulty. The inflective nature mean that the vocabulary as seen by a computer appears huge, because each inflectional form is a distinct word. Unlike English, Czech does not obey the subject-verb-object ordering of words as in “Peter sells cars.” The identification of subjects or objects is often via their inflectional forms, and discourse plays a role in syntactic analysis. Participants in this project plan to explore techniques of syntactic analysis, both known and new, which utilize inflectional information to deal with the free word order. The techniques developed here for Czech newspaper text are expected to be useful for Polish, Russian, Serbo-Croatian and other Slavic languages, and for languages such as Spanish, German and Italian which exhibit inflectional and free word order behavior to smaller degrees.

 

Team Members 
Senior Members
Eric BrillCLSP/JHU
Jan HajicCharles Univ., CR
Doug JonesDoD
Lance RamshawBBN
Graduate Students
Michael CollinsUPenn
Barbora HladkaCharles Univ., CR
Christoph TillmannLehrstuhl, Aachen
Daniel ZemanCharles Univ., CR
Undergraduate Students
Cynthia KuoStanford
Oren SchwartzUPenn

Center for Language and Speech Processing