|
|
Automatic translation from one human language to another
using computers, better known as machine translation (MT), is a
longstanding
goal of computer science. In order to be able to perform such a
task,
the computer must "know" the two languages --- synonyms for words and
phrases,
grammars of the two languages, and semantic or world knowledge. One
way to incorporate such knowledge into a computer is to use bilingual
experts
to hand-craft the necessary information into the computer program.
Another is to let the computer learn some of these things automatically
by examining large amounts of parallel text: documents which are nearly
exact translations of each other. The Canadian government produces
one such resource, for example, in the form of parliamentary proceedings
which are recorded in both English and French.
Recently, statistical data analysis has been used to
gather
MT knowledge automatically, from parallel bilingual text. The
techniques
have unfortunately not been disseminated to the scientific community in
very usable form, and new follow-on ideas have not developed
rapidly.
In pre-workshop activity, we plan to reconstruct a baseline statistical
MT system for distribution to all researchers, and to use it as a platform
for workshop experiments. These experiments will include working
with morphology, online dictionaries, widely available monolingual texts,
and syntax. The goal will be to improve the accuracy of the baseline
and/or achieve the same accuracy with only limited parallel corpora.
We will work with the French-English Hansard data as well as with a new
language, perhaps Czech or Chinese.
|