Universal Dependency Parsing for Hindi-English Code-Switching

Bhat, Irshad Ahmad; Bhat, Riyaz Ahmad; Shrivastava, Manish; Sharma, Dipti Misra

doi:10.18653/v1/n18-1090

Cited by 37 publications

(30 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We test M-BERT on the CS Hindi/English UD corpus from Bhat et al (2018), which provides texts in two formats: transliterated, where Hindi words are written in Latin script, and corrected, where annotators have converted them back to Devanagari script. Table 6 shows the results for mod- els fine-tuned using a combination of monolingual Hindi and English, and using the CS training set (both fine-tuning on the script-corrected version of the corpus as well as the transliterated version).…”

Section: Code Switching and Transliterationmentioning

confidence: 99%

How Multilingual is Multilingual BERT?

Pires

Schlinger

Garrette

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

898

719

View full text Add to dashboard Cite

In this paper, we show that Multilingual BERT (M-BERT), released by Devlin et al. (2019) as a single language model pre-trained from monolingual corpora in 104 languages, is surprisingly good at zero-shot cross-lingual model transfer, in which task-specific annotations in one language are used to fine-tune the model for evaluation in another language. To understand why, we present a large number of probing experiments, showing that transfer is possible even to languages in different scripts, that transfer works best between typologically similar languages, that monolingual corpora can train models for code-switching, and that the model can find translation pairs. From these results, we can conclude that M-BERT does create multilingual representations, but that these representations exhibit systematic deficiencies affecting certain language pairs.

show abstract

Section: Code Switching and Transliterationmentioning

confidence: 99%

How Multilingual is Multilingual BERT?

Pires

Schlinger

Garrette

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

898

719

View full text Add to dashboard Cite

show abstract

“…We use approaches such as language modeling, transliteration, and translation to alleviate the ab-sence of code-mixing in the data used to pre-train transformer models. Masked Language Modeling: We fine-tune mBERT on the masked language modeling objective, following Khanuja et al (2020b), on a combination of in-domain code-mixed movie scripts and publicly available datasets by Roy et al (2013) and Bhat et al (2018) to obtain modified mBERT (mod-mBERT) to be fine-tuned on the sentencepair classification task. Transliteration: We perform token-level language identification and transliterate the detected Romanized Hindi words in CS-NLI to Devanagari script using the approach in Singh et al (2018), to enable mBERT to better understand them.…”

Section: Addressing Code-mixingmentioning

confidence: 99%

Detecting Entailment in Code-Mixed Hindi-English Conversations

Chakravarthy¹,

Umapathy²,

Black³

2020

Proceedings of the Sixth Workshop on Noisy User-Generated Text (W-Nut 2020)

View full text Add to dashboard Cite

The presence of large-scale corpora for Natural Language Inference (NLI) has spurred deep learning research in this area, though much of this research has focused solely on monolingual data. Code-mixing is the intertwined usage of multiple languages, and is commonly seen in informal conversations among polyglots. Given the rising importance of dialogue agents, it is imperative that they understand code-mixing, but the scarcity of code-mixed Natural Language Understanding (NLU) datasets has precluded research in this area. The dataset by Khanuja et al. (2020a) for detecting conversational entailment in codemixed Hindi-English text is the first of its kind. We investigate the effectiveness of language modeling, data augmentation, translation, and architectural approaches to address the codemixed, conversational, and low-resource aspects of this dataset. We obtain +8.09% test set accuracy over the current state of the art.

show abstract

“…We use the datasets released by Dhar et al (2018) and Srivastava and Singh (2020), the statistics of the datasets are provided in the Table 1. Since both the datasets contain Hindi words in Roman script, we use the CSNLI library 2 (Bhat et al, 2017(Bhat et al, , 2018 as a preprocessing step. It transliterates the Hindi words to Devanagari and also performs text normalization.…”

Section: Data Preparationmentioning

confidence: 99%

Bilingual Low-Resource Neural Machine Translation with Round-Tripping: The Case of Persian-Spanish

Ahmadnia

Dorr

2019

Proceedings - Natural Language Processing in a Deep Learning World

View full text Add to dashboard Cite

The quality of Neural Machine Translation (NMT), as a data-driven approach, massively depends on quantity, quality and relevance of the training dataset. Such approaches have achieved promising results for bilingually high-resource scenarios but are inadequate for low-resource conditions. This paper describes a roundtrip training approach to bilingual lowresource NMT that takes advantage of monolingual datasets to address training data scarcity, thus augmenting translation quality. We conduct detailed experiments on Persian-Spanish as a bilingually low-resource scenario. Experimental results demonstrate that this competitive approach outperforms the baselines.

show abstract

Universal Dependency Parsing for Hindi-English Code-Switching

Cited by 37 publications

References 28 publications

How Multilingual is Multilingual BERT?

How Multilingual is Multilingual BERT?

Detecting Entailment in Code-Mixed Hindi-English Conversations

Bilingual Low-Resource Neural Machine Translation with Round-Tripping: The Case of Persian-Spanish

Contact Info

Product

Resources

About