GLUECoS: An Evaluation Benchmark for Code-Switched NLP

Khanuja, Simran; Dandapat, Sandipan; Srinivasan, Anirudh; Sitaram, Sunayana; Choudhury, Monojit

doi:10.18653/v1/2020.acl-main.329

Cited by 71 publications

(94 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We use approaches such as language modeling, transliteration, and translation to alleviate the ab-sence of code-mixing in the data used to pre-train transformer models. Masked Language Modeling: We fine-tune mBERT on the masked language modeling objective, following Khanuja et al (2020b), on a combination of in-domain code-mixed movie scripts and publicly available datasets by Roy et al (2013) and Bhat et al (2018) to obtain modified mBERT (mod-mBERT) to be fine-tuned on the sentencepair classification task. Transliteration: We perform token-level language identification and transliterate the detected Romanized Hindi words in CS-NLI to Devanagari script using the approach in Singh et al (2018), to enable mBERT to better understand them.…”

Section: Addressing Code-mixingmentioning

confidence: 99%

Detecting Entailment in Code-Mixed Hindi-English Conversations

Chakravarthy¹,

Umapathy²,

Black³

2020

Proceedings of the Sixth Workshop on Noisy User-Generated Text (W-Nut 2020)

View full text Add to dashboard Cite

The presence of large-scale corpora for Natural Language Inference (NLI) has spurred deep learning research in this area, though much of this research has focused solely on monolingual data. Code-mixing is the intertwined usage of multiple languages, and is commonly seen in informal conversations among polyglots. Given the rising importance of dialogue agents, it is imperative that they understand code-mixing, but the scarcity of code-mixed Natural Language Understanding (NLU) datasets has precluded research in this area. The dataset by Khanuja et al. (2020a) for detecting conversational entailment in codemixed Hindi-English text is the first of its kind. We investigate the effectiveness of language modeling, data augmentation, translation, and architectural approaches to address the codemixed, conversational, and low-resource aspects of this dataset. We obtain +8.09% test set accuracy over the current state of the art.

show abstract

Section: Addressing Code-mixingmentioning

confidence: 99%

Detecting Entailment in Code-Mixed Hindi-English Conversations

Chakravarthy¹,

Umapathy²,

Black³

2020

Proceedings of the Sixth Workshop on Noisy User-Generated Text (W-Nut 2020)

View full text Add to dashboard Cite

show abstract

“…Many works have attempted to model code-switching text and speech from a statistical perspective (Garg et al, 2018a,b). Recent works and benchmarks such as Linguistic Codeswitching Evaluation (LinCE) (Aguilar et al, 2020) and GLUECoS (Khanuja et al, 2020) have provided a unified platform to evaluate CS data for various NLP tasks across various language pairs. Our work is in line with these recent efforts to pro-vide NLP capabilities to users with diverse linguistic backgrounds.…”

Section: Code-switching Strategiesmentioning

confidence: 99%

Understanding Linguistic Accommodation in Code-Switched Human-Machine Dialogues

Parekh¹,

Ahn²,

Tsvetkov³

et al. 2020

Proceedings of the 24th Conference on Computational Natural Language Learning

View full text Add to dashboard Cite

Code-switching is a ubiquitous phenomenon in multilingual communities. Natural language technologies that wish to communicate like humans must therefore adaptively incorporate code-switching techniques when they are deployed in multilingual settings. To this end, we propose a Hindi-English human-machine dialogue system that elicits code-switching conversations in a controlled setting. It uses different code-switching agent strategies to understand how users respond and accommodate to the agent's language choice. Through this system, we collect and release a new dataset COMMONDOST, comprising of 439 humanmachine multilingual conversations. We adapt pre-defined metrics to discover linguistic accommodation from users to agents. Finally, we compare these dialogues with Spanish-English dialogues collected in a similar setting, and analyze the impact of linguistic and socio-cultural factors on code-switching patterns across the two language pairs. 1

show abstract

“…Sinha and Thakur (2005) presented a rule-based machine translation system to translate the code-mixed Hindi-English sentence to monolingual Hindi and English forms. Khanuja et al (2020) presented an evaluation benchmark for the two code-mixed language pairs (English-Hindi and English-Spanish). The proposed evaluation benchmark has six NLP tasks, i.e., language identification, POS tagging, named entity recognition, sentiment analysis, question answering, and natural language inference.…”

Section: Introductionmentioning

confidence: 99%

PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation

Srivastava¹,

Singh²

2020

Proceedings of the Sixth Workshop on Noisy User-Generated Text (W-Nut 2020)

View full text Add to dashboard Cite

Code-mixing is the phenomenon of using more than one language in a sentence. In the multilingual communities, it is a very frequently observed pattern of communication on social media platforms. Flexibility to use multiple languages in one text message might help to communicate efficiently with the target audience. But, the noisy user-generated codemixed text adds to the challenge of processing and understanding natural language to a much larger extent. Machine translation from monolingual source to the target language is a wellstudied research problem. Here, we demonstrate that widely popular and sophisticated translation systems such as Google Translate fail at times to translate code-mixed text effectively. To address this challenge, we present a parallel corpus of the 13,738 code-mixed Hindi-English sentences and their corresponding human translation in English. In addition, we also propose a translation pipeline build on top of Google Translate. The evaluation of the proposed pipeline on P HIN C demonstrates an increase in the performance of the underlying system. With minimal effort, we can extend the dataset and the proposed approach to other code-mixing language pairs.

show abstract

GLUECoS: An Evaluation Benchmark for Code-Switched NLP

Cited by 71 publications

References 22 publications

Detecting Entailment in Code-Mixed Hindi-English Conversations

Detecting Entailment in Code-Mixed Hindi-English Conversations

Understanding Linguistic Accommodation in Code-Switched Human-Machine Dialogues

PHINC: A Parallel Hinglish Social Media Code-Mixed Corpus for Machine Translation

Contact Info

Product

Resources

About