The Web and the Word: Alternative Sources for Bilingual Text

Philip Resnik, Linguistics Department/UMIACS, College Park

November 10, 1998


Parallel corpora -- collections of text in parallel translation -- play an important role in current work on statistical models of machine translation, cross-language information retrieval, and acquisition of lexical resources for multilingual natural language processing.  Unfortunately, parallel corpora may be difficult or expensive to obtain, may be too domain- or genre-specific, or may simply not exist for the language pair of interest.

I will discuss two approaches to overcoming the acquisition bottleneck for parallel text.  The first part of the talk will describe first steps toward using the World Wide Web as a source for parallel text, presenting a conceptually simple but effective technique for automatically identifying parallel translated documents on the Web.  The second part of the talk will discuss the use of the Bible as a parallel corpus, describing the initial phase of a project investigating the use of parallel biblical text as a resource for improving multilingual optical character recognition.