CoMeT: Towards Code-Mixed Translation Using Parallel Monolingual Sentences

Gautam, Devansh; Kodali, Prashant; Gupta, Kshitij; Goel, Anmol; Shrivastava, Manish; Kumaraguru, Ponnurangam

doi:10.18653/v1/2021.calcs-1.7

“…• LTRC-PreCog (Gautam et al, 2021). They propose to use mBART, a pre-trained multilingual sequence-to-sequence model and fully utilize the pre-training of the model by transliterating the roman Hindi words in the codemixed sentences to Devanagri script.…”

Section: Methodsmentioning

confidence: 99%

CALCS 2021 Shared Task: Machine Translation for Code-Switched Data

Chen¹,

Aguilar²,

Srinivasan³

et al. 2022

Preprint

2

0

View full text Add to dashboard Cite

To date, efforts in the code-switching literature have focused for the most part on language identification, POS, NER, and syntactic parsing. In this paper, we address machine translation for code-switched social media data. We create a community shared task. We provide two modalities for participation: supervised and unsupervised. For the supervised setting, participants are challenged to translate English into Hindi-English (Eng-Hinglish) in a single direction. For the unsupervised setting, we provide the following language pairs: English and Spanish-English (Eng-Spanglish), and English and Modern Standard Arabic-Egyptian Arabic (Eng-MSAEA) in both directions. We share insights and challenges in curating the "into" code-switching language evaluation data. Further, we provide baselines for all language pairs in the shared task. The leaderboard for the shared task comprises 12 individual system submissions corresponding to 5 different teams. The best performance achieved is 12.67% BLEU score for English to Hinglish and 25.72% BLEU score for MSAEA to English.

show abstract

“…Code-switching in NLP has seen a rise of interest in recent years, including a dedicated workshop starting in 2014 (Diab et al, 2014) and still ongoing (Solorio et al, 2021). CS in machine translation also has a long history (Le Féal, 1990;Climent et al, 2003;Sinha and Thakur, 2005;Johnson et al, 2017;Elmadany et al, 2021;Xu and Yvon, 2021), but has seen a rise of interest with the advent of large multilingual models such as mBART (Liu et al, 2020) or mT5 (Xue et al, 2020;Gautam et al, 2021;Jawahar et al, 2021). Due to the lack of available CS data and the ease of single-word translation, most of these recent related MT works have synthetically created CS data for either training or testing by translating one or more of the words in a sentence (Song et al, 2019;Nakayama et al, 2019;Xu and Yvon, 2021;.…”

Section: Related Workmentioning

confidence: 99%

End-to-End Speech Translation for Code Switched Speech

Weller¹,

Sperber²,

Pires³

et al. 2022

Findings of the Association for Computational Linguistics: ACL 2022

View full text Add to dashboard Cite

Code switching (CS) refers to the phenomenon of interchangeably using words and phrases from different languages. CS can pose significant accuracy challenges to NLP, due to the often monolingual nature of the underlying systems. In this work, we focus on CS in the context of English/Spanish conversations for the task of speech translation (ST), generating and evaluating both transcript and translation. To evaluate model performance on this task, we create a novel ST corpus derived from existing public data sets. 1 We explore various ST architectures across two dimensions: cascaded (transcribe then translate) vs end-toend (jointly transcribe and translate) and unidirectional (source → target) vs bidirectional (source ↔ target). We show that our ST architectures, and especially our bidirectional end-to-end architecture, perform well on CS speech, even when no CS training data is used.

show abstract

“…Code-switching in NLP has seen a rise of interest in recent years, including a dedicated workshop starting in 2014 (Diab et al, 2014) and still ongoing (Solorio et al, 2021). CS in machine translation also has a long history (Le Féal, 1990;Climent et al, 2003;Sinha and Thakur, 2005;Johnson et al, 2017;Elmadany et al, 2021;Xu and Yvon, 2021), but has seen a rise of interest with the advent of large multilingual models such as mBART (Liu et al, 2020) or mT5 (Xue et al, 2020;Gautam et al, 2021;Jawahar et al, 2021). Due to the lack of available CS data and the ease of single-word translation, most of these recent related MT works have synthetically created CS data for either training or testing by translating one or more of the words in a sentence (Song et al, 2019;Nakayama et al, 2019;Xu and Yvon, 2021;.…”

Section: Related Workmentioning

confidence: 99%

End-to-End Speech Translation for Code Switched Speech

Weller¹,

Sperber²,

Pires³

et al. 2022

Preprint

View full text Add to dashboard Cite

Code switching (CS) refers to the phenomenon of interchangeably using words and phrases from different languages. CS can pose significant accuracy challenges to NLP, due to the often monolingual nature of the underlying systems. In this work, we focus on CS in the context of English/Spanish conversations for the task of speech translation (ST), generating and evaluating both transcript and translation. To evaluate model performance on this task, we create a novel ST corpus derived from existing public data sets. 1 We explore various ST architectures across two dimensions: cascaded (transcribe then translate) vs end-toend (jointly transcribe and translate) and unidirectional (source → target) vs bidirectional (source ↔ target). We show that our ST architectures, and especially our bidirectional end-to-end architecture, perform well on CS speech, even when no CS training data is used.

show abstract

CoMeT: Towards Code-Mixed Translation Using Parallel Monolingual Sentences

Cited by 24 publications

References 26 publications

CALCS 2021 Shared Task: Machine Translation for Code-Switched Data

CALCS 2021 Shared Task: Machine Translation for Code-Switched Data

End-to-End Speech Translation for Code Switched Speech

End-to-End Speech Translation for Code Switched Speech

Contact Info

Product

Resources

About