|
|
Syntactic analysis is one of the crucial ingredients of
natural language understanding. When we hear a sentence such as ``I saw
John,'' we identify saw as the main verb or event in the sentence, I as
the subject doing the seeing and John as the object being seen. While this
example is simple, things become complex very quickly as the sentence to
be understood grows longer. A common problem that arises is ambiguity,
e.g., in the sentence ``I saw the man with a telescope,'' either the seen
man had a telescope or a telescope was used to see the man. Even for
moderately long sentences, tens and hundreds of thousands of distinct
analyses are possible. Yet automatic syntactic analysis based on
statistical methods has been quite successful for English - state of the
art parsers correctly extract 90% of the dependencies from newspaper text
such as the Wall Street Journal. This is done by annotating a large number
of sentences by hand and building a statistical model which estimates how
frequently a particular analysis is encountered. This model then ranks the
various analyses of a new sentence by likelihood, and efficiently computes
the most likely one. Most state of the art parsing models heavily use
lexical information in choosing an analysis. While much is known about
parsing English text, it is easy to see that parsing a highly inflective
language or a free word order language such as Czech adds a new dimension
of difficulty. The inflective nature mean that the vocabulary as seen by a
computer appears huge, because each inflectional form is a distinct word.
Unlike English, Czech does not obey the subject-verb-object ordering of
words as in ``Peter sells cars.'' The identification of subjects or
objects is often via their inflectional forms, and discourse plays a role
in syntactic analysis. Participants in this project plan to explore
techniques of syntactic analysis, both known and new, which utilize
inflectional information to deal with the free word order. The techniques
developed here for Czech newspaper text are expected to be useful for
Polish, Russian, Serbo-Croatian and other Slavic languages, and for
languages such as Spanish, German and Italian which exhibit inflectional
and free word order behavior to smaller degrees. |