The SMarT Classifier for Arabic Fine-Grained Dialect Identification

Research studies on Machine Translation (MT) between Modern Standard Arabic (MSA) and English are abundant. However, studies on MT between Omani Arabic (OA) dialects and English are very scarce. This research study focuses on the lack of availability of an Omani dialect parallel dataset, as well as MT of OA to English. The study uses social media data from X (formerly Twitter) to build an authentic parallel text of the Omani dialects 1 . The research presents baseline results on this dataset using Google Translate, Microsoft Translation, and Marian NMT. A taxonomy of the most common linguistic errors is used to analyze the translations made by the NMT systems to provide insights on future improvements. Finally, transfer learning is used to adapt Marian NMT to the Omani dialect, with significant improvement of 9.88 points in the BLEU score.

Section: Dialectical Arabic Datasetsmentioning

confidence: 99%

Machine Translation of Omani Arabic Dialect from Social Media

Al-Kharusi,

AAlAbdulsalam

2023

“…This prompted researchers to create new DA datasets, usually targeting a limited number of specific regions or countries (Gadalla et al, 1997;Diab et al, 2010;Al-Sabbagh and Girju, 2012;Sadat et al, 2014;Harrat et al, 2014;Jarrar et al, 2016;Khalifa et al, 2016;Al-Twairesh et al, 2018;Alsarsour et al, 2018;Kwaik et al, 2018;El-Haj, 2020). This was followed by several works that introduced multi-dialectal datasets and models for regionlevel dialect identification (Zaidan and Callison-Burch, 2011;Bouamor et al, 2014;Meftouh et al, 2015). The initial Arabic dialect identification shared tasks were part of the VarDial workshop series, primarily utilizing transcriptions of speech broadcasts (Malmasi et al, 2016).…”

Section: Arabic Dialectsmentioning

confidence: 99%

NADI 2023: The Fourth Nuanced Arabic Dialect Identification Shared Task

Abdul-Mageed,

Elmadany,

Zhang

et al. 2023

We describe the findings of the fourth Nuanced Arabic Dialect Identification Shared Task (NADI 2023). The objective of NADI is to help advance state-of-the-art Arabic NLP by creating opportunities for teams of researchers to collaboratively compete under standardized conditions. It does so with a focus on Arabic dialects, offering novel datasets and defining subtasks that allow for meaningful comparisons between different approaches. NADI 2023 targeted both dialect identification (Subtask 1) and dialect-to-MSA machine translation (Subtask 2 and Subtask 3). A total of 58 unique teams registered for the shared task, of whom 18 teams have participated (with 76 valid submissions during test phase). Among these, 16 teams participated in Subtask 1, 5 participated in Subtask 2, and 3 participated in Subtask 3. The winning teams achieved 87.27 F 1 on Subtask 1, 14.76 Bleu in Subtask 2, and 21.10 Bleu in Subtask 3, respectively. Results show that all three subtasks remain challenging, thereby motivating future work in this area. We describe the methods employed by the participating teams and briefly offer an outlook for NADI.

“…(2) Translation in which participants are asked to translate sentences into their native Arabic dialects (Ho, 2006-;Meftouh et al, 2015;Bouamor et al, 2018;Mubarak, 2018). If all the participants are asked to translate the same source sentences, then the dataset is composed of parallel sentences in various dialects.…”

Section: Dialects Sentencementioning

confidence: 99%

“…MPCA -/ 5 / 3 -2,000 Egyptian Arabic sentences from a pre-existing corpus, manually translated into 4 other country-level dialects in addition to MSA. PADIC (Meftouh et al, 2015) 5 / 4 / 2 -6,400 sentences sampled from the transcripts of recorded conversations and movie/TV shows in Algerian Arabic and manually translated into 4 other dialects and MSA. DIAL2MSA (Mubarak, 2018) -/ -/ 4 -Dialectal tweets manually translated into MSA.…”

Section: Dialects Sentencementioning

confidence: 99%

Arabic Dialect Identification under Scrutiny: Limitations of Single-label Classification

Keleg,

Magdy

2023

Automatic Arabic Dialect Identification (ADI) of text has gained great popularity since it was introduced in the early 2010s. Multiple datasets were developed, and yearly shared tasks have been running since 2018. However, ADI systems are reported to fail in distinguishing between the micro-dialects of Arabic. We argue that the currently adopted framing of the ADI task as a single-label classification problem is one of the main reasons for that. We highlight the limitation of the incompleteness of the Dialect labels and demonstrate how it impacts the evaluation of ADI systems. A manual error analysis for the predictions of an ADI, performed by 7 native speakers of different Arabic dialects, revealed that ≈ 66% of the validated errors are not true errors. Consequently, we propose framing ADI as a multi-label classification task and give recommendations for designing new ADI datasets.