Sebastian Nehrdich (Berkeley) “MITRA: Beyond Just Machine Translation for Premodern Asian Low Resource Languages”

When:
October 25, 2024 @ 12:00 pm – 1:15 pm
2024-10-25T12:00:00-04:00
2024-10-25T13:15:00-04:00
Where:
Hackerman Hall B17
3400 N CHARLES ST
Baltimore
MD 21218
Cost:
Free

Abstract

Recent years saw the rise of multilingual language models that achieve high levels of performance for a large number of tasks, with some of them handling hundreds of languages at once. Premodern languages are usually underrepresented in such models, leading to poor performance in downstream applications. In my talk, I will introduce the Dharmamitra project, which aims to develop a diverse set of language models to address these shortcomings for the classical Asian low-resource languages Sanskrit, Tibetan, Classical Chinese, and Pali. These models are providing solutions for low-level NLP tasks such as word segmentation, morpho-syntactic tagging etc., as well as high-level tasks such as semantic search, machine translation, and general chatbot interaction. I will talk about the individual challenges and unique characteristics of the data involved, and what strategies we deploy to address these. I will also demonstrate how  these different tools can be combined in an application that goes beyond simple sentence-to-sentence machine translation, but instead provides detailed grammatical explanations and corpus-wide search to provide the users with as much relevant information as possible. This application is helpful for early-stage languages learners on the one hand, as well as experienced researchers with high level of language knowledge and very specific demands on the other hand.

Biography

Sebastian Nehrdich is a research assistant at the University of California, Berkeley, where he is leading the MITRA project together with Prof. Kurt Keutzer. He is also currently completing a PhD degree in computational linguistics at the University of Duesseldorf, Germany. His main research interest is the application of NLP methods to premodern Asian languages with focus on Sanskrit, Pali, Tibetan and Classical Chinese. He holds a master’s degree in Buddhist Studies and worked extensively with literature in premodern languages while also publishing articles on NLP problems for these languages in major conferences. Since 2018, he is the chief developer of the BuddhaNexus platform, a website that provides an interactive and user-friendly interface for the exploration of textual reuse instances in large text collections of the Buddhist tradition with over a hundred million of intertextual links. Since 2023 The MITRA project provides grammatical analyzers, machine translation, semantic search capabilities for classical Asian languages and currently serves between 500-1000 users with primarily academic background on a daily basis.

Center for Language and Speech Processing