LAE: Language-Aware Encoder for Monolingual and Multilingual ASR

Tian, Jinchuan; Yu, Jianyong; Zhang, Chunlei; Weng, Chao; Zou, Yuexian; Yu, Dong

doi:10.21437/interspeech.2022-923

Cited by 8 publications

(3 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…What should be the behavior of the monolingual Mandarin module p(Z M |X) when encountering a segment of English speech and vice versa? Monolingual modules in prior works [29][30][31] determine each label-to-frame alignment z M/E t by first determining the language identity of each speech frame LID(xt) [34]. If the speech frame xt is from a foreign language then the module will ignore it by emitting a special <NULL> token, otherwise it will transcribe using its monolingual vocabulary.…”

Section: Modeling P(z M/e |X) With Language Segmentationmentioning

confidence: 99%

“…Finally, let us consider how to construct a neural architecture for our modified conditionally factorized framework. Monolingual and bilingual label-to-frame posteriors ( §2.1) may be modeled using CTC or RNN-T networks as demonstrated by prior works [29][30][31]. However for zero-shot CS ASR, the conditional independence assumption of CTC vs. the internal language modeling of RNN-T is a critical difference.…”

Section: Conditional Ctc With External Lm Architecturementioning

confidence: 99%

“…A more promising direction towards zero-shot CS ASR can be found in prior works which seek to incorporate monolingual data directly to improve CS performance [14][15][16][17][18][19][20][21][22][23][24][25][26][27][28]. In particular, there are several works which achieve joint modeling of CS and monolingual ASR by conditionally factorizing the overall bilingual task into monolingual parts [29][30][31]. By using label-to-frame synchronization, this conditionally factorized approach can make a CS prediction given only the predictions of the monolingual parts [29] -theoretically these conditionally factorized models can model CS ASR without any CS data, but this has not been previously confirmed.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Towards Zero-Shot Code-Switched Speech Recognition

Yan¹,

Wiesner²,

Klejch³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot setting where no transcribed CS speech data is available for training. Previously proposed frameworks which conditionally factorize the bilingual task into its constituent monolingual parts are a promising starting point for leveraging monolingual data efficiently. However, these methods require the monolingual modules to perform language segmentation. That is, each monolingual module has to simultaneously detect CS points and transcribe speech segments of one language while ignoring those of other languages -not a trivial task. We propose to simplify each monolingual module by allowing them to transcribe all speech segments indiscriminately with a monolingual script (i.e. transliteration). This simple modification passes the responsibility of CS point detection to subsequent bilingual modules which determine the final output by considering multiple monolingual transliterations along with external language model information. We apply this transliteration-based approach in an end-to-end differentiable neural network and demonstrate its efficacy for zeroshot CS ASR on Mandarin-English SEAME test sets.

show abstract