CLSP Web SiteWS 98 Site Map
CLSP
logo
WS 98 Research Projects
An NSF Workshop: Language Engineering for Students
and Professionals Integrating Research and Education
Three Core
Natural Language Processing Technology
Syntactic analysis is one of the crucial ingredients of natural language understanding. When we hear a sentence such as ``I saw John,'' we identify saw as the main verb or event in the sentence, I as the subject doing the seeing and John as the object being seen. While this example is simple, things become complex very quickly as the sentence to be understood grows longer. A common problem that arises is ambiguity, e.g., in the sentence ``I saw the man with a telescope,'' either the seen man had a telescope or a telescope was used to see the man. Even for moderately long sentences, tens and hundreds of thousands of distinct analyses are possible. Yet automatic syntactic analysis based on statistical methods has been quite successful for English - state of the art parsers correctly extract 90% of the dependencies from newspaper text such as the Wall Street Journal. This is done by annotating a large number of sentences by hand and building a statistical model which estimates how frequently a particular analysis is encountered. This model then ranks the various analyses of a new sentence by likelihood, and efficiently computes the most likely one. Most state of the art parsing models heavily use lexical information in choosing an analysis. While much is known about parsing English text, it is easy to see that parsing a highly inflective language or a free word order language such as Czech adds a new dimension of difficulty. The inflective nature mean that the vocabulary as seen by a computer appears huge, because each inflectional form is a distinct word. Unlike English, Czech does not obey the subject-verb-object ordering of words as in ``Peter sells cars.'' The identification of subjects or objects is often via their inflectional forms, and discourse plays a role in syntactic analysis. Participants in this project plan to explore techniques of syntactic analysis, both known and new, which utilize inflectional information to deal with the free word order. The techniques developed here for Czech newspaper text are expected to be useful for Polish, Russian, Serbo-Croatian and other Slavic languages, and for languages such as Spanish, German and Italian which exhibit inflectional and free word order behavior to smaller degrees.
Return

The Center for Language and Speech Processing

Johns Hopkins University

3400 N. Charles Street, Barton Hall, Baltimore, MD 21218
Telephone: 410 516 4237 Fax: 410 516 5050 E-mail: clsp@jhu.edu
CLSP Web SiteWS 98 Site Map