Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.329
|View full text |Cite
|
Sign up to set email alerts
|

GLUECoS: An Evaluation Benchmark for Code-Switched NLP

Abstract: Code-switching is the use of more than one language in the same conversation or utterance. Recently, multilingual contextual embedding models, trained on multiple monolingual corpora, have shown promising results on cross-lingual and multilingual tasks. We present an evaluation benchmark, GLUECoS, for code-switched languages, that spans several NLP tasks in English-Hindi and English-Spanish. Specifically, our evaluation benchmark includes Language Identification from text, POS tagging, Named Entity Recognition… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
93
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 71 publications
(94 citation statements)
references
References 22 publications
1
93
0
Order By: Relevance
“…We use approaches such as language modeling, transliteration, and translation to alleviate the ab-sence of code-mixing in the data used to pre-train transformer models. Masked Language Modeling: We fine-tune mBERT on the masked language modeling objective, following Khanuja et al (2020b), on a combination of in-domain code-mixed movie scripts and publicly available datasets by Roy et al (2013) and Bhat et al (2018) to obtain modified mBERT (mod-mBERT) to be fine-tuned on the sentencepair classification task. Transliteration: We perform token-level language identification and transliterate the detected Romanized Hindi words in CS-NLI to Devanagari script using the approach in Singh et al (2018), to enable mBERT to better understand them.…”
Section: Addressing Code-mixingmentioning
confidence: 99%
“…We use approaches such as language modeling, transliteration, and translation to alleviate the ab-sence of code-mixing in the data used to pre-train transformer models. Masked Language Modeling: We fine-tune mBERT on the masked language modeling objective, following Khanuja et al (2020b), on a combination of in-domain code-mixed movie scripts and publicly available datasets by Roy et al (2013) and Bhat et al (2018) to obtain modified mBERT (mod-mBERT) to be fine-tuned on the sentencepair classification task. Transliteration: We perform token-level language identification and transliterate the detected Romanized Hindi words in CS-NLI to Devanagari script using the approach in Singh et al (2018), to enable mBERT to better understand them.…”
Section: Addressing Code-mixingmentioning
confidence: 99%
“…Many works have attempted to model code-switching text and speech from a statistical perspective (Garg et al, 2018a,b). Recent works and benchmarks such as Linguistic Codeswitching Evaluation (LinCE) (Aguilar et al, 2020) and GLUECoS (Khanuja et al, 2020) have provided a unified platform to evaluate CS data for various NLP tasks across various language pairs. Our work is in line with these recent efforts to pro-vide NLP capabilities to users with diverse linguistic backgrounds.…”
Section: Code-switching Strategiesmentioning
confidence: 99%
“…Sinha and Thakur (2005) presented a rule-based machine translation system to translate the code-mixed Hindi-English sentence to monolingual Hindi and English forms. Khanuja et al (2020) presented an evaluation benchmark for the two code-mixed language pairs (English-Hindi and English-Spanish). The proposed evaluation benchmark has six NLP tasks, i.e., language identification, POS tagging, named entity recognition, sentiment analysis, question answering, and natural language inference.…”
Section: Introductionmentioning
confidence: 99%