The Web and the Word: Alternative Sources for Bilingual Text – Philip Resnik (Linguistics Department/UMIACS, College Park)
Abstract
Parallel corpora — collections of text in parallel translation — play an important role in current work on statistical models of machine translation, cross-language information retrieval, and acquisition of lexical resources for multilingual natural language processing. Unfortunately, parallel corpora may be difficult or expensive to obtain, may be too domain- or genre-specific, or may simply not exist for the language pair of interest. I will discuss two approaches to overcoming the acquisition bottleneck for parallel text. The first part of the talk will describe first steps toward using the World Wide Web as a source for parallel text, presenting a conceptually simple but effective technique for automatically identifying parallel translated documents on the Web. The second part of the talk will discuss the use of the Bible as a parallel corpus, describing the initial phase of a project investigating the use of parallel biblical text as a resource for improving multilingual optical character recognition.