Statistical Machine Translation

Research Group of the 1999 Summer Workshop

Automatic translation from one human language to another using computers, better known as machine translation (MT), is a longstanding goal of computer science. In order to be able to perform such a task, the computer must “know” the two languages — synonyms for words and phrases, grammars of the two languages, and semantic or world knowledge. One way to incorporate such knowledge into a computer is to use bilingual experts to hand-craft the necessary information into the computer program. Another is to let the computer learn some of these things automatically by examining large amounts of parallel text: documents which are nearly exact translations of each other. The Canadian government produces one such resource, for example, in the form of parliamentary proceedings which are recorded in both English and French.

Recently, statistical data analysis has been used to gather MT knowledge automatically, from parallel bilingual text. The techniques have unfortunately not been disseminated to the scientific community in very usable form, and new follow-on ideas have not developed rapidly. In pre-workshop activity, we plan to reconstruct a baseline statistical MT system for distribution to all researchers, and to use it as a platform for workshop experiments. These experiments will include working with morphology, online dictionaries, widely available monolingual texts, and syntax. The goal will be to improve the accuracy of the baseline and/or achieve the same accuracy with only limited parallel corpora. We will work with the French-English Hansard data as well as with a new language, perhaps Czech or Chinese.

Final Report

Team Members
Senior Members
David Yarowsky	CLSP
Kevin Knight	USC/ISI
John Lafferty	CMU
Dan Melamed	West Group
David Purdy	DoD
Graduate Students
Yaser Al-Onaizan	USC/ISI
Jan Curin	Charles Univ., CR
Franz Och	RWTH Aachen
Undergraduate Students
Noah Smith	CLSP
Michael Jahr	Stanford

Statistical Machine Translation

Upcoming Seminars

Center for Language and Speech Processing