Translingual Fine-grained Morphosyntactic Analysis
and its Application to Machine Translation
English and a small set of other languages have a wealth of available linguistic knowledge resources and
annotated language data, but the great majority of the world's languages have little or none. This dissertation
describes work which leverages the detailed and accurate morphosyntactic analyses available for English to improve
analytical capabilities for a diverse set of other languages. This includes the targeted enrichment of English
morphosyntactic analysis, translingual projection of that analysis to bootstrap analyses of other languages, and
exploitation of that richer feature space for improved machine translation and bitext word alignment. Emphasis is
on the combination of multiple sources of information, including both explicitly expressed human linguistic
knowledge and patterns observed in monolingual and bilingual corpora, and on language pairs where advanced analysis
capabilities are available for one language and unavailable for the other.
Selected contributions to science described in this dissertation include:
- Proposal and execution of the concept of tagging English with a quasi-universal part-of-speech tag set of
fine-grained morphosyntactic features designed for effective translingual annotation transfer from English to a
diverse set of world languages.
- Demonstration of the feasibility of automatically tagging English with a quasi-universal part-of-speech
tagset with high accuracy, including the large percentage of quasi-universal features which are not realized
via surface English morphology.
- Demonstration of the high-performance extraction of fine-grained morphosyntactic tags from several
state-of-the-art parsers, the combination of which outperforms the syntactic analysis extracted from any
- Demonstration of successful fine-grained tagset mapping between languages to enable translingual projection
between non-isomorphic fine-grained tagsets.
- Demonstration of successful bootstrapping from this projection, using automatically trained system
combination to integrate multiple information sources.
- Demonstration that enrichment of conditioning for machine translation by inclusion of fine-grained
morphosyntactic tagging can provide significant gains in the accuracy of lexical choice in machine translation.
- Demonstration that morphological expansion of a translation lexicon can provide significant improvements in
- Demonstration that such expansion followed by weighting or filtering by empirically estimated correspondences
between source- and target-language inflectional forms can improve translation performance.
- Demonstration that syntactically transforming the target language into an English′ reordering of parsed English to closely parallel the source-language word order can provide substantial improvements in word-alignment performance.