Many of the foundational techniques that underlie modern computational linguistics make two fundamental assumptions: (1) that many words consist of only one morpheme and (2) that techniques that work well on English and other widely used European languages will also work across typologically diverse languages of the world. A major component of the second assumption is the notion that words (rather than morphemes) serve as the primary meaning-bearing units within a sentence.
In high-resource languages, especially those that are analytic rather than synthetic, a common approach is to assume that morphologically-distinct variants (such as dog and dogs) are completely independent word types, rather than inflected variants of a common root. For those European languages such as Czech and Russian that have more inflectional morphology than English, some degree of morphological analysis, normalization, or stemming is sometimes attempted. Within the NLP literature, Finnish and Turkish (both agglutinative languages) are commonly held up as extreme examples of morphological complexity that challenge common modeling assumptions. Yet, when considering all of the world’s languages, Finnish and Turkish are closer to the average case in terms of synthesis. The assumptions listed above commonly fail with respect to even moderately challenging languages such as Finnish and Turkish. These assumptions are fatally flawed with respect to many low resource languages, but especially with respect to polysynthetic languages such as those in the Inuit-Yupik language family.
To this end, we propose a common neural modeling architecture for one highly challenging low-resource polysynthetic language family, ranging geographically from Greenland through Canada and Alaska to far eastern Russia. The languages in this family, including Kalaallisut, Inuktitut, Inupiaq, and St. Lawrence Island Yupik, are extraordinarily challenging from a computational perspective, with pervasive use of derivational morphemes in addition to rich sets of inflectional suffixes and phonological challenges at morpheme boundaries.
Finite-state morphological analyzers (with varying degrees of coverage) have been developed for three varieties of Inuit and one variety of Yupik (Chen and Schwartz, 2018). Other existing tools include a baseline MT system (Micher, 2018) for Inuktitut and two basic spell-checkers. Digitized dictionaries exist for some languages, including a comparative dictionary of the Alaskan varieties. Relatively small monolingual and bilingual corpora exist, many in printed form only. The largest digitized corpus is the Inuktitut-English proceedings of the Legislative Assembly of Nunavut.
We expect our work will require novel neural architectures that explicitly model characters and morphemes in addition to words and sentences. We plan to make explicit use of existing finite-state resources and the existing Inuit-Yupik comparative dictionary to develop linguistically-informed neural architectures that reflect the historical and typological relations between languages. For example, we might wish to make use of parameter sharing for cognates (for languages that are quite close to one another), while sharing only more abstract representations (such as those that encode syntax) for more distantly related languages. We will use predictive text completion (including with MT) as a practical end-use application.