Language processing for arabic microblog retrieval

Darwish, Kareem; Magdy, Walid; Mourad, Ahmed

doi:10.1145/2396761.2398658

Cited by 70 publications

(62 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Tweets and user locations were normalized and cleaned in the manner described in Darwish et al (2012) by mapping frequent non-Arabic characters and decoration to their mappings, handling repeated characters, etc. Below in an example that shows a tweet before and after normalization: Before: mbrwwwwwwk yA bA$A.…”

Section: Tweet Normalizationmentioning

confidence: 99%

Using Twitter to Collect a Multi-Dialectal Corpus of Arabic

Mubarak¹,

Darwish²

2014

Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

Self Cite

View full text Add to dashboard Cite

This paper describes the collection and classification of a multi-dialectal corpus of Arabic based on the geographical information of tweets. We mapped information of user locations to one of the Arab countries, and extracted tweets that have dialectal word(s). Manual evaluation of the extracted corpus shows that the accuracy of assignment of tweets to some countries (like Saudi Arabia and Egypt) is above 93% while the accuracy for other countries, such Algeria and Syria is below 70%.

show abstract

Section: Tweet Normalizationmentioning

confidence: 99%

Using Twitter to Collect a Multi-Dialectal Corpus of Arabic

Mubarak¹,

Darwish²

2014

Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…We plan to release the tweet ID's and our annotations. We preprocessed the training and test sets using the method described by Darwish et al (2012), which includes performing letter and word normalizations, and segmented all data using an open-source MSA word segmentor (Darwish et al, 2012). We also removed punctuations, hashtags, and name mentions from the test set.…”

Section: Evaluation Setupmentioning

confidence: 99%

Verifiably Effective Arabic Dialect Identification

Darwish¹,

Sajjad²,

Mubarak³

2014

Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

View full text Add to dashboard Cite

Several recent papers on Arabic dialect identification have hinted that using a word unigram model is sufficient and effective for the task. However, most previous work was done on a standard fairly homogeneous dataset of dialectal user comments. In this paper, we show that training on the standard dataset does not generalize, because a unigram model may be tuned to topics in the comments and does not capture the distinguishing features of dialects. We show that effective dialect identification requires that we account for the distinguishing lexical, morphological, and phonological phenomena of dialects. We show that accounting for such can improve dialect detection accuracy by nearly 10% absolute.

show abstract

“…-Basic normalization dataset is the normalized dataset with the basic Arabic normalization process to correct the most common Arabic misspellings. This is the same as the one that was used in [9], [26]:…”

Section: Sub-datasetmentioning

confidence: 99%