Towards Language-Universal Mandarin-English Speech Recognition

Zhang, Shiliang; Liu, Yuan; Lei, Ming; Ma, Bin; Xie, Lei

doi:10.21437/interspeech.2019-1365

Cited by 12 publications

(10 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly, a phoneme-based modeling unit was also studied to achieve the multilingual ASR task [44]. A network sharing approach was also developed to recognize the Chinese and English languages [45]. A multi-task learning mechanism was proposed to obtain an end-to-end multilingual task in [46].…”

Section: A Asr Related Workmentioning

confidence: 99%

ATCSpeechNet: A multilingual end-to-end speech recognition framework for air traffic control systems

Lin

Yang

et al. 2021

Applied Soft Computing

View full text Add to dashboard Cite

Section: A Asr Related Workmentioning

confidence: 99%

ATCSpeechNet: A multilingual end-to-end speech recognition framework for air traffic control systems

Lin

Yang

et al. 2021

Applied Soft Computing

View full text Add to dashboard Cite

“…Their results showed that bytes are superior to grapheme characters over a wide variety of languages in monolingual end-to-end speech recognition. Characters are the most commonly used modeling unit for end-to-end ASR in Mandarin Chinese; sub-words have also been employed [45].…”

Section: Modeling Units In Mandarin Asrmentioning

confidence: 99%

Decoupling recognition and transcription in Mandarin ASR

Yuan¹,

Cai²,

Gao³

et al. 2021

Preprint

View full text Add to dashboard Cite

Much of the recent literature on automatic speech recognition (ASR) is taking an end-to-end approach. Unlike English where the writing system is closely related to sound, Chinese characters (Hanzi) represent meaning, not sound. We propose factoring audio → Hanzi into two sub-tasks: (1) audio → Pinyin and (2) Pinyin → Hanzi, where Pinyin is a system of phonetic transcription of standard Chinese. Factoring the audio → Hanzi task in this way achieves 3.9% CER (character error rate) on the Aishell-1 corpus, the best result reported on this dataset so far.

show abstract

“…Compared to monolingual ASR with plenty of monolingual data, CS ASR is limited by hard-to-collect speech and transcriptions, especially in the era of deep learning. Therefore, reducing the demand for CS data and making full use of rich resources monolingual data have become research hotspots [13,14,15,16,17,18]. Dual-encoder structure is an effective way to make full use of two monolingual data [15,16,17,18].…”

Section: Introductionmentioning

confidence: 99%

Language-specific Characteristic Assistance for Code-switching Speech Recognition

Song¹,

Xu²,

Meng³

et al. 2022

Preprint

View full text Add to dashboard Cite

Dual-encoder structure successfully utilizes two languagespecific encoders (LSEs) for code-switching speech recognition. Because LSEs are initialized by two pre-trained languagespecific models (LSMs), the dual-encoder structure can exploit sufficient monolingual data and capture the individual language attributes. However, existing methods have no language constraints on LSEs and underutilize language-specific knowledge of LSMs. In this paper, we propose a language-specific characteristic assistance (LSCA) method to mitigate the above problems. Specifically, during training, we introduce two languagespecific losses as language constraints and generate corresponding language-specific targets for them. During decoding, we take the decoding abilities of LSMs into account by combining the output probabilities of two LSMs and the mixture model to obtain the final predictions. Experiments show that either the training or decoding method of LSCA can improve the model's performance. Furthermore, the best result can obtain up to 15.4% relative error reduction on the code-switching test set by combining the training and decoding methods of LSCA. Moreover, the system can process code-switching speech recognition tasks well without extra shared parameters or even retraining based on two pre-trained LSMs by using our method.

show abstract

Towards Language-Universal Mandarin-English Speech Recognition

Cited by 12 publications

References 29 publications

ATCSpeechNet: A multilingual end-to-end speech recognition framework for air traffic control systems

ATCSpeechNet: A multilingual end-to-end speech recognition framework for air traffic control systems

Decoupling recognition and transcription in Mandarin ASR

Language-specific Characteristic Assistance for Code-switching Speech Recognition

Contact Info

Product

Resources

About