Statistical Machine Translation

Automatic translation from one human language to another using computers, better known as machine translation (MT), is a longstanding goal of computer science. In order to be able to perform such a task, the computer must “know” the two languages — synonyms for words and phrases, grammars of the two languages, and semantic or world knowledge. One way to incorporate such knowledge into a computer is to use bilingual experts to hand-craft the necessary information into the computer program. Another is to let the computer learn some of these things automatically by examining large amounts of parallel text: documents which are nearly exact translations of each other. The Canadian government produces one such resource, for example, in the form of parliamentary proceedings which are recorded in both English and French.

Recently, statistical data analysis has been used to gather MT knowledge automatically, from parallel bilingual text. The techniques have unfortunately not been disseminated to the scientific community in very usable form, and new follow-on ideas have not developed rapidly. In pre-workshop activity, we plan to reconstruct a baseline statistical MT system for distribution to all researchers, and to use it as a platform for workshop experiments. These experiments will include working with morphology, online dictionaries, widely available monolingual texts, and syntax. The goal will be to improve the accuracy of the baseline and/or achieve the same accuracy with only limited parallel corpora. We will work with the French-English Hansard data as well as with a new language, perhaps Czech or Chinese.

Final Report

 

Team Members
Senior Members
David YarowskyCLSP
Kevin KnightUSC/ISI
John LaffertyCMU
Dan MelamedWest Group
David PurdyDoD
Graduate Students
Yaser Al-OnaizanUSC/ISI
Jan CurinCharles Univ., CR
Franz OchRWTH Aachen
Undergraduate Students
Noah SmithCLSP
Michael JahrStanford

Center for Language and Speech Processing