Modeling Code-Switch Languages Using Bilingual Parallel Corpus

Lee, Grandee; Li, Haizhou

doi:10.18653/v1/2020.acl-main.80

Cited by 19 publications

(18 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A sentence-encoder is needed to map the individual utterances within D onto the vector space. Firstly, we fine-tune a RoBERTa-base pre-trained language model (Liu et al, 2019) with training data of the target dialogue domain, because task-adaptive finetuning of the pre-trained language model on the target domain data benefits the final performance (Gururangan et al, 2020;Lee and Li, 2020). Next, the mean pooling operation is performed on the token embeddings within each utterance of D to derive their respective utterance-level representations.…”

Section: Dialogue Utterance Representationmentioning

confidence: 99%

DynaEval: Unifying Turn and Dialogue Level Evaluation

Zhang¹,

Chen²,

D’Haro³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

A dialogue is essentially a multi-turn interaction among interlocutors. Effective evaluation metrics should reflect the dynamics of such interaction. Existing automatic metrics are focused very much on the turn-level quality, while ignoring such dynamics. To this end, we propose DynaEval 1 , a unified automatic evaluation framework which is not only capable of performing turn-level evaluation, but also holistically considers the quality of the entire dialogue. In DynaEval, the graph convolutional network (GCN) is adopted to model a dialogue in totality, where the graph nodes denote each individual utterance and the edges represent the dependency between pairs of utterances. A contrastive loss is then applied to distinguish well-formed dialogues from carefully constructed negative samples. Experiments show that DynaEval significantly outperforms the state-of-the-art dialogue coherence model, and correlates strongly with human judgements across multiple dialogue evaluation aspects at both turn and dialogue level.

show abstract

Section: Dialogue Utterance Representationmentioning

confidence: 99%

DynaEval: Unifying Turn and Dialogue Level Evaluation

Zhang¹,

Chen²,

D’Haro³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

show abstract

“…We show that the proposed approach outperforms the naive MLM rescoring which is without the conversion mentioned above by 7.23% relative WER reduction on Mainland China Code-Switch (MLCCS). We also improve an over 7.08% relative WER reduction from the Bilingual Attention Language Model(BALM) [4] which achieves state-of-the-art performance on the SEAME [15] code-switch dataset.…”

Section: Introductionmentioning

confidence: 91%

“…It's becoming fairly common in today's globalized world not only among bilingual societies but also in predominantly monolingual societies, and more and more speakers use a second language in the professional context. Codeswitch speech recognition poses a significant challenge [1] even as the recent ASR system reaches outstanding performance [2,3,4]. It introduces more vocabulary choices at each prediction step due to the words from two languages and appears freely and sparingly without strict syntactic or grammatical rules.…”

Section: Introductionmentioning

confidence: 99%

Code-Switch Speech Rescoring with Monolingual Data

Liu

Cao

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In the automatic speech recognition (ASR) system, how to solve the problem of code-switch speech recognition has been a concern. Code-switch speech recognition is challenging due to data scarcity as well as diverse syntactic structures across languages. In this paper, we focus on the code-switch speech recognition in mainland China, which is obviously different from the Hong Kong and Southeast Asia area in linguistic characteristics. We propose a novel approach that only uses monolingual data for code-switch second-pass speech recognition which is also named language model rescoring. The approach converts the code-switch sentence to a monolingual sentence by a word mapping and language model determination step, therefore the issue of data scarcity is unnecessary to be considered. The word pairs during the word mapping step are generated by a fine-designed generation process that incorporates machine translation, word alignment, etc. We show that the proposed approach achieves an over 7.23% relative WER reduction from the naive monolingual language model (MLM) rescoring in our test set.

show abstract

“…The main challenge addressed in these works is the limited availability of codemixed sentences. Gonen and Goldberg (2019) and Lee and Li (2020) propose different methods of training LMs for CM sentences without explicitly creating synthetic CM data, but another popular strategy is to first create synthetic CM data and train the LM with such synthetic data. We next summarize existing approaches to generate syn-thetic CM data: propose to learn switching patterns from code-mixed data using a GAN-based adversarial training.…”

Section: Related Workmentioning

confidence: 99%

Training Data Augmentation for Code-Mixed Translation

Gupta¹,

Vavre²,

Sarawagi³

2021

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Langua

View full text Add to dashboard Cite

Machine translation of user-generated codemixed inputs to English is of crucial importance in applications like web search and targeted advertising. We address the scarcity of parallel training data for training such models by designing a strategy of converting existing non-code-mixed parallel data sources to codemixed parallel data. We present an mBERT based procedure whose core learnable component is a ternary sequence labeling model, that can be trained with a limited code-mixed corpus alone. We show a 5.8 point increase in BLEU on heavily code-mixed sentences by training a translation model using our data augmentation strategy on an Hindi-English codemixed translation task.

show abstract

Modeling Code-Switch Languages Using Bilingual Parallel Corpus

Cited by 19 publications

References 36 publications

DynaEval: Unifying Turn and Dialogue Level Evaluation

DynaEval: Unifying Turn and Dialogue Level Evaluation

Code-Switch Speech Rescoring with Monolingual Data

Training Data Augmentation for Code-Mixed Translation

Contact Info

Product

Resources

About