Proceedings of the Fourth Arabic Natural Language Processing Workshop 2019
DOI: 10.18653/v1/w19-4629
|View full text |Cite
|
Sign up to set email alerts
|

Mawdoo3 AI at MADAR Shared Task: Arabic Tweet Dialect Identification

Abstract: Arabic dialect identification is an inherently complex problem, as Arabic dialect taxonomy is convoluted and aims to dissect a continuous space rather than a discrete one. In this work, we present machine and deep learning approaches to predict 21 fine-grained dialects form a set of given tweets per user. We adopted numerous feature extraction methods most of which showed improvement in the final model, such as word embedding, Tf-idf, and other tweet features. Our results show that a simple LinearSVC can outpe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3

Relationship

2
6

Authors

Journals

citations
Cited by 11 publications
(8 citation statements)
references
References 10 publications
0
4
0
Order By: Relevance
“…The outcome demonstrates the BilLex application's effectiveness in obtaining the cross-lingual equivalents of words and sentences in other languages (Shi et al 2019). As part of the Multi-Arabic Dialect Applications and Resources (MADAR) shared challenge, LSTM with fastText predicts the Arabic dialect from a collection of Arabic tweets with an accuracy of 50.59% (Talafha et al 2019). Urdu is a low-resource language that needs a framework for interpretable subject modeling.…”
Section: Topic Modellingmentioning
confidence: 94%
“…The outcome demonstrates the BilLex application's effectiveness in obtaining the cross-lingual equivalents of words and sentences in other languages (Shi et al 2019). As part of the Multi-Arabic Dialect Applications and Resources (MADAR) shared challenge, LSTM with fastText predicts the Arabic dialect from a collection of Arabic tweets with an accuracy of 50.59% (Talafha et al 2019). Urdu is a low-resource language that needs a framework for interpretable subject modeling.…”
Section: Topic Modellingmentioning
confidence: 94%
“…However, the field suffers from fragmented and independent works on different corpora that vary in terms of granularity, size and domain, making it challenging to track the progress of the solutions. Early work focused on binary dialect classification by discriminating one dialect from MSA (Elfardy and Diab, 2013;Tillmann et al, 2014), as well as identifying Arabic dialects at both a region-level Callison-Burch, 2011, 2014;Cotterell and Callison-Burch, 2014) and a country-level (Talafha et al, 2020;Abdelali et al, 2021;AlKhamissi et al, 2021).…”
Section: Previous Workmentioning
confidence: 99%
“…However, it does have its limitation such as the lack of support for the various Arabic dialects. To address this, one might benefit from existing multi-dialect parallel datasets [49][50][51][52][53][54] or build new ones (perhaps, by benefiting from unsupervised approaches for dialect translation [7]). Another issue that can be addressed before adopting ATAR in real-life scenarios is trying to increase the model's accuracy.…”
Section: Experiments and Evaluationmentioning
confidence: 99%