Nate Robinson (JHU) “NLP for Related Languages”
3400 N CHARLES ST
Baltimore
MD 21218
Abstract
In the age of data- and capital-driven machine learning, the gap between technological advancements for high- and low-resource language varieties keeps growing, leaving many with the greatest need for language technologies without access to them. Because languages are interrelated through contact and ancestry, technologies for low-resource languages could benefit from data and models of their high-resource relatives. However, language relatedness is not a one-dimensional measure, and language relations that may seem helpful for one technology may not be. I present here explorations of availing language relatedness on axes of phonology, morphosyntax, acoustics, and phylogenetics. Findings suggest that morphosyntactic relatedness between transfer languages is helpful for machine translation, that phonological information is helpful in some language processing applications, that cross-lingual data augmentation can assist low-resource speech technologies, and that algorithms modeling linguistic change can assist computational linguistics inquiries. These findings have implications for groups of related low-resource languages such as Creole languages of the African diaspora, Arabic language varieties, and Sinitic languages.