Parsing Arabic Dialects

Research Group of the 2005 Summer Workshop

Problem Definition: The proposed project will tackle the problem of parsing Arabic dialects. Parsing is an important component in many advanced NLP systems, and has also been shown to be useful for language modeling for ASR. As is well known, Arabic exhibits diglossia, i.e., the coexistence of two forms of language, a high variety with standard orthography and sociopolitical clout which is not natively spoken by anyone (Modern Standard Arabic, MSA) and low varieties that are primarily spoken and lack writing standards (Arabic dialects). The dialects and MSA form a continuum of variation at the lexical, phonological, morphological, and syntactic levels.

There are important resources currently available for MSA with much on-going NLP work; for example, there are several syntactic and semantic parsers for MSA. However, Arabic dialect resources and NLP research are still at an infancy stage. There are linguistic studies of Arabic dialectal syntax but there is no language engineering work (such as computational grammars). There are no parallel written corpora between any of the dialects and any other language, including MSA. Thus, most of the techniques developed for parsing that exploit supervised (in the canonical sense) machine learning do not apply, since there is no sufficient annotated data to learn from. We would like to leverage existing resources and tools for MSA in order to parse Arabic dialects using both symbolic techniques and machine learning approaches.

Impact

General NLP research: We will investigate how to leverage available syntactic resources for families of resource-poor languages.
Tools: we will create standard tools, i.e. parsers with compatible tokenization and morphological analysis components, for the processing of Arabic (MSA and dialects). These can be used in applications such as dialect translation, information retrieval, information extraction from speech data, dialect transcription, language modeling for ASR, and semantic parsing of Arabic dialects.
Resources: we will create standards for the transcription of Arabic dialects, as well as grammars and small corpora and lexica.

Opening Day Presentation
Arabic NLP, Tutorial by Nizar Habash
Team Update
Tregex and Tsurgeon, Tutorial by Roger Levy
Closing Day Presentation
Final Report

Team Members
Senior Members
David Chiang	University of Maryland
Mona Diab	Columbia University
Nizar Habash	Columbia University
Rebecca Hwa	University of Pittsburgh
Owen Rambow	Columbia University
Khalil Sima'an	University of Amsterdam
Graduate Students
Roger Levy	Stanford University
Carol Nichols	University of Pittsburgh
Undergraduate Students
Vincent Lacey	Georgia Tech
Safiullah Shareef	Johns Hopkins University

Parsing Arabic Dialects

Upcoming Seminars

Center for Language and Speech Processing