Parsing Arabic Dialects


The Arabic language exhibits diglossia, i.e., the coexistence of two forms of language, a variety with standard orthography and sociopolitical clout which is not natively spoken by anyone (Modern Standard Arabic, MSA) and varieties that are primarily spoken and lack writing standards (Arabic dialects). To give an example from English, the contrast is similar to the contrast between African American dialect and Broadcast American English. The dialects and MSA form a continuum of variation at the lexical, phonological, morphological, and syntactic levels. Our project aims at discovering ways of parsing Arabic dialects, i.e., of automatically determining the underlying structure of a sentence. There are important resources currently available for MSA with much on-going NLP work; for example, there are several syntactic and semantic parsers for MSA. However, Arabic dialect resources and NLP research are still at an infancy stage. There are few written corpora available for the dialects, partly because of the lack of standard orthographies. There are linguistic studies of Arabic dialectal syntax but there is no language engineering work (such as computational grammars). Our approach uses the MSA resources, knowledge of the linguistics of the dialect (syntax, morphology, lexicon, phonology), and machine learning in marshalling the MSA resources and the linguistic knowledge. The undergraduates on the project will be given a broad exposure to linguistic and computational research, while working closely on a particular problem with the senior members of the project. Knowledge of Arabic is not required, but interest in linguistic issues is desirable.

The Center for Language and Speech Processing
The Johns Hopkins University
3400 North Charles Street, Barton Hall
Baltimore, MD 21218
*Telephone: (410) 516-4237 *Fax: (410) 516-5050 *E-mail: clsp@clsp.jhu.edu