Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu 2018
DOI: 10.18653/v1/n18-1090
|View full text |Cite
|
Sign up to set email alerts
|

Universal Dependency Parsing for Hindi-English Code-Switching

Abstract: Code-switching is a phenomenon of mixing grammatical structures of two or more languages under varied social constraints. The code-switching data differ so radically from the benchmark corpora used in NLP community that the application of standard technologies to these data degrades their performance sharply. Unlike standard corpora, these data often need to go through additional processes such as language identification, normalization and/or back-transliteration for their efficient processing. In this paper, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
28
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
4
1

Relationship

1
9

Authors

Journals

citations
Cited by 37 publications
(30 citation statements)
references
References 28 publications
0
28
0
Order By: Relevance
“…We test M-BERT on the CS Hindi/English UD corpus from Bhat et al (2018), which provides texts in two formats: transliterated, where Hindi words are written in Latin script, and corrected, where annotators have converted them back to Devanagari script. Table 6 shows the results for mod- els fine-tuned using a combination of monolingual Hindi and English, and using the CS training set (both fine-tuning on the script-corrected version of the corpus as well as the transliterated version).…”
Section: Code Switching and Transliterationmentioning
confidence: 99%
“…We test M-BERT on the CS Hindi/English UD corpus from Bhat et al (2018), which provides texts in two formats: transliterated, where Hindi words are written in Latin script, and corrected, where annotators have converted them back to Devanagari script. Table 6 shows the results for mod- els fine-tuned using a combination of monolingual Hindi and English, and using the CS training set (both fine-tuning on the script-corrected version of the corpus as well as the transliterated version).…”
Section: Code Switching and Transliterationmentioning
confidence: 99%
“…We use approaches such as language modeling, transliteration, and translation to alleviate the ab-sence of code-mixing in the data used to pre-train transformer models. Masked Language Modeling: We fine-tune mBERT on the masked language modeling objective, following Khanuja et al (2020b), on a combination of in-domain code-mixed movie scripts and publicly available datasets by Roy et al (2013) and Bhat et al (2018) to obtain modified mBERT (mod-mBERT) to be fine-tuned on the sentencepair classification task. Transliteration: We perform token-level language identification and transliterate the detected Romanized Hindi words in CS-NLI to Devanagari script using the approach in Singh et al (2018), to enable mBERT to better understand them.…”
Section: Addressing Code-mixingmentioning
confidence: 99%
“…We use the datasets released by Dhar et al (2018) and Srivastava and Singh (2020), the statistics of the datasets are provided in the Table 1. Since both the datasets contain Hindi words in Roman script, we use the CSNLI library 2 (Bhat et al, 2017(Bhat et al, , 2018 as a preprocessing step. It transliterates the Hindi words to Devanagari and also performs text normalization.…”
Section: Data Preparationmentioning
confidence: 99%