The Arabic language exhibits diglossia, i.e., the coexistence of two forms of language, a variety with standard orthography and sociopolitical clout which is not natively
spoken by anyone (Modern Standard Arabic, MSA) and varieties that are primarily spoken and lack writing standards (Arabic dialects). To give an example from English, the
contrast is similar to the contrast between African American dialect and Broadcast American English. The dialects and MSA form a continuum of variation at the lexical,
phonological, morphological, and syntactic levels. Our project aims at discovering ways of parsing Arabic dialects, i.e., of automatically determining the underlying
structure of a sentence. There are important resources currently available for MSA with much on-going NLP work; for example, there are several syntactic and semantic parsers
for MSA. However, Arabic dialect resources and NLP research are still at an infancy stage. There are few written corpora available for the dialects, partly because of the
lack of standard orthographies. There are linguistic studies of Arabic dialectal syntax but there is no language engineering work (such as computational grammars). Our
approach uses the MSA resources, knowledge of the linguistics of the dialect (syntax, morphology, lexicon, phonology), and machine learning in marshalling the MSA resources
and the linguistic knowledge. The undergraduates on the project will be given a broad exposure to linguistic and computational research, while working closely on a
particular problem with the senior members of the project. Knowledge of Arabic is not required, but interest in linguistic issues is desirable.
|