Exploring Retraining-free Speech Recognition for Intra-sentential Code-switching

Huang, Zhen; Zhuang, Xinhua; Liu, Daben; Xiao, Xiaoqiang; Zhang, Yuchen; Siniscalchi, Sabato Marco

doi:10.1109/icassp.2019.8682478

Cited by 5 publications

(2 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They looked at phone-merging techniques to handle the two languages in acoustic modeling, explored further in [9,31], and generating codeswitched text data for language modeling, studied more in [32,39]. Since then, different approaches have been applied to improve codeswitched speech recognition like speech chains [40], transliteration [41], and translation [42]. Authors in [14,27,43] focus on tracking the language switch points, similar to our LID aware training.…”

Section: Relation To Prior Workmentioning

confidence: 99%

Transformer-Transducers for Code-Switched Speech Recognition

Dalmia

Liu

Ronanki

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We live in a world where 60% of the population can speak two or more languages fluently. Members of these communities constantly switch between languages when having a conversation. As automatic speech recognition (ASR) systems are being deployed to the real-world, there is a need for practical systems that can handle multiple languages both within an utterance or across utterances. In this paper, we present an end-to-end ASR system using a transformertransducer model architecture for code-switched speech recognition. We propose three modifications over the vanilla model in order to handle various aspects of code-switching. First, we introduce two auxiliary loss functions to handle the low-resource scenario of codeswitching. Second, we propose a novel mask-based training strategy with language ID information to improve the label encoder training towards intra-sentential code-switching. Finally, we propose a multilabel/multi-audio encoder structure to leverage the vast monolingual speech corpora towards code-switching. We demonstrate the efficacy of our proposed approaches on the SEAME dataset, a public Mandarin-English code-switching corpus, achieving a mixed error rate of 18.5% and 26.3% on testman and testsge sets respectively.

show abstract

Section: Relation To Prior Workmentioning

confidence: 99%

Transformer-Transducers for Code-Switched Speech Recognition

Dalmia

Liu

Ronanki

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…We also apply the multi-graph decoding strategy [17], which contains code-switch and English 1 and the code-switch dataset as Table 2, and the English is trained by the transcript of librispeech. Besides, we implement the methods that optimizes the NGram language model in [7] and [18] as the baseline of the first-pass decoding, and these two methods together achieve 7% relative WER reduction. The number of NBEST is set to 128.…”

Section: Setupmentioning

confidence: 99%

Code-Switch Speech Rescoring with Monolingual Data

Liu

Cao

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In the automatic speech recognition (ASR) system, how to solve the problem of code-switch speech recognition has been a concern. Code-switch speech recognition is challenging due to data scarcity as well as diverse syntactic structures across languages. In this paper, we focus on the code-switch speech recognition in mainland China, which is obviously different from the Hong Kong and Southeast Asia area in linguistic characteristics. We propose a novel approach that only uses monolingual data for code-switch second-pass speech recognition which is also named language model rescoring. The approach converts the code-switch sentence to a monolingual sentence by a word mapping and language model determination step, therefore the issue of data scarcity is unnecessary to be considered. The word pairs during the word mapping step are generated by a fine-designed generation process that incorporates machine translation, word alignment, etc. We show that the proposed approach achieves an over 7.23% relative WER reduction from the naive monolingual language model (MLM) rescoring in our test set.

show abstract