CALCS 2021 Shared Task: Machine Translation for Code-Switched Data

Chen, Shuguang; Aguilar, Gustavo; Srinivasan, Anirudh; Diab, Mona; Solorio, Thamar

doi:10.48550/arxiv.2202.09625

Cited by 2 publications

(3 citation statements)

References 20 publications

(26 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As for work on CS MT, there are many efforts (Sinha and Thakur, 2005;Dhar et al, 2018;Mahata et al, 2019;Menacer et al, 2019;Song et al, 2019;Tarunesh et al, 2021;Xu and Yvon, 2021;Chen et al, 2022;Hamed et al, 2022c). To the best of our knowledge, none of these efforts presented an extensive comparison covering different segmentation techniques.…”

Section: Related Workmentioning

confidence: 99%

“…We identify three main challenges for CS MT. First is data sparsity, a challenge common to many CS language pairs because of limited parallel corpora containing commissioned translations of CS text (Çetinoglu et al, 2016;Srivastava and Singh, 2020;Tarunesh et al, 2021;Hamed et al, 2022b;Chen et al, 2022). Second is Egyptian Arabic morphological richness, which further exacerbates the data sparsity situation (Habash et al, 2012a,b).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text

Gaser,

Mager,

Hamed

et al. 2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

Data sparsity is one of the main challenges posed by code-switching (CS), which is further exacerbated in the case of morphologically rich languages. For the task of machine translation (MT), morphological segmentation has proven successful in alleviating data sparsity in monolingual contexts; however, it has not been investigated for CS settings. In this paper, we study the effectiveness of different segmentation approaches on MT performance, covering morphology-based and frequency-based segmentation techniques. We experiment on MT from code-switched Arabic-English to English. We provide detailed analysis, examining a variety of conditions, such as data size and sentences with different degrees of CS. Empirical results show that morphology-aware segmenters perform the best in segmentation tasks but under-perform in MT. Nevertheless, we find that the choice of the segmentation setup to use for MT is highly dependent on the data size. For extreme low-resource scenarios, a combination of frequency and morphology-based segmentations is shown to perform the best. For more resourced settings, such a combination does not bring significant improvements over the use of frequency-based segmentation.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text

Gaser,

Mager,

Hamed

et al. 2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…Significant research efforts have been dedicated to various code-switched tasks in the field of Natural Language Processing (NLP), such as Language Identification, Named Entity Recognition (NER), POS Tagging, Sentiment Analysis, Question Answering, and Natural Language Inference (NLI) (Khanuja et al, 2020;Jose et al, 2020;Chen et al, 2022;Rizwan et al, 2020). However, there has been limited exploration in the domain of propaganda detection, particularly for low-resource languages.…”

Section: Introductionmentioning

confidence: 99%

Detecting Propaganda Techniques in Code-Switched Social Media Text

Salman,

Hanif,

Shehata

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Propaganda is a form of communication intended to influence the opinions and the mindset of the public to promote a particular agenda. With the rise of social media, propaganda has spread rapidly, leading to the need for automatic propaganda detection systems. Most work on propaganda detection has focused on high-resource languages, such as English, and little effort has been made to detect propaganda for low-resource languages. Yet, it is common to find a mix of multiple languages in social media communication, a phenomenon known as code-switching. Code-switching combines different languages within the same text, which poses a challenge for automatic systems. Considering this premise, we propose a novel task of detecting propaganda techniques in codeswitched text. To support this task, we create a corpus of 1,030 texts code-switching between English and Roman Urdu, annotated with 20 propaganda techniques at the fragment level. We perform a number of experiments contrasting different experimental setups, and we find that it is important to model the multilinguality directly rather than using translation as well as to use the right fine-tuning strategy. The code and the dataset are publicly available at https://github.com/mbzuai-nlp/ propaganda-codeswitched-text WARNING: This paper contains examples and words that are offensive in nature.

show abstract

CALCS 2021 Shared Task: Machine Translation for Code-Switched Data

Cited by 2 publications

References 20 publications

Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text

Exploring Segmentation Approaches for Neural Machine Translation of Code-Switched Egyptian Arabic-English Text

Detecting Propaganda Techniques in Code-Switched Social Media Text

Contact Info

Product

Resources

About