Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Zhang, Chao; Li, Bo; Sainath, Tara N.; Strohman, Trevor; Sepand, Mavandadi,; Chang, Shuo-Yiin; Haghani, Parisa

doi:10.21437/interspeech.2022-11249

Cited by 14 publications

(11 citation statements)

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…( 6) of [31]; we denote this merged CTC likelihood as PCTC(Z|X). We then jointly decode PCTC(•) with an external bilingual LM, PB LM(Y ), using the time-synchronous beam search described in [43], which approximates the following decision: (12) where {V M ∪ V E } * denotes the set of all possible bilingual outputs. 3 This architecture, which we refer to as Conditional CTC, is 1 Unlike text-based transliteration [35], pseudo-labeling relies solely on the resources presumed to be available in our zero-shot CS ASR settings.…”

Section: Conditional Ctc With External Lm Architecturementioning

confidence: 99%

“…Therefore, a preeminent challenge in the CS ASR field is to build effective systems under the zero-shot setting where no CS ASR training data is available. Recent advancements in multilingual speech recognition have demonstrated the impressive scale of cross-lingual sharing in neural network approaches [3][4][5][6][7][8][9][10][11][12], and these works have shown that jointly modeling ASR with language identity (LID) grants some intra-sentential CS ability [11][12][13]. However, most of these large scale models skew towards highresourced languages [9] and do not seek to directly optimize for intra-sentential CS ASR between particular language pairs.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Towards Zero-Shot Code-Switched Speech Recognition

Yan¹,

Wiesner²,

Klejch³

et al. 2022

Preprint

View full text Add to dashboard Cite

In this work, we seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot setting where no transcribed CS speech data is available for training. Previously proposed frameworks which conditionally factorize the bilingual task into its constituent monolingual parts are a promising starting point for leveraging monolingual data efficiently. However, these methods require the monolingual modules to perform language segmentation. That is, each monolingual module has to simultaneously detect CS points and transcribe speech segments of one language while ignoring those of other languages -not a trivial task. We propose to simplify each monolingual module by allowing them to transcribe all speech segments indiscriminately with a monolingual script (i.e. transliteration). This simple modification passes the responsibility of CS point detection to subsequent bilingual modules which determine the final output by considering multiple monolingual transliterations along with external language model information. We apply this transliteration-based approach in an end-to-end differentiable neural network and demonstrate its efficacy for zeroshot CS ASR on Mandarin-English SEAME test sets.

show abstract

Section: Conditional Ctc With External Lm Architecturementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Towards Zero-Shot Code-Switched Speech Recognition

Yan¹,

Wiesner²,

Klejch³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…A neural transducer model [4] has three components: an acoustic encoder, a label prediction network, and a joint network. Neural transducer models can use different types of models as encoders such as LSTMs in RNN-T [4] and transformers [7,8,9,17,20,21,22] in transformer transducer (T-T). In this study, we use T-T as the backbone model for the development.…”

Section: Transformer Transducer Modelmentioning

confidence: 99%

“…While end-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) [1][2][3][4][5][6][7][8], there are large amount of demands of multilingual ASR models since there are more than 60% people in the world can speak more than 2 languages according to [10]. There have been plenty of efforts to develop E2E multilingual models [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26], and these models can achieve the comparable or even better ASR performance than monolingual baselines by passing the language identification (LID) information in the form of a one-hot or learnable embedding vector to distinguish different languages. In order to build streaming multilingual ASR systems for lots of practical applications that can perform similarly as the monolingual ones, we should not request users to input any LID information during model inference.…”

Section: Introductionmentioning

confidence: 99%

“…In order to build streaming multilingual ASR systems for lots of practical applications that can perform similarly as the monolingual ones, we should not request users to input any LID information during model inference. One solution is to infer LID as an embedding vector and attach it to the input features [18,19,21]. However, this kind of solution either leads to limited improvement due to LID prediction inaccuracy or introduces extra latency for reliable LID prediction [18,19].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Improving Multilingual Transformer Transducer Models by Reducing Language Confusions

Sun¹,

Li²,

Meng³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

We propose gated language experts to improve multilingual transformer transducer models without any language identification (LID) input from users during inference. We define gating mechanism and LID loss to let transformer encoders learn language-dependent information, construct the multilingual transformer block with gated transformer experts and shared transformer layers for compact models, and apply linear experts on joint network output to better regularize speech acoustic and token label joint information. Furthermore, a curriculum training scheme is proposed to let LID guide the gated language experts for better serving their corresponding languages. Evaluated on the English and Spanish bilingual task, our methods achieve average 12.5% and 7.3% relative word error reductions over the baseline bilingual model and monolingual models, respectively, obtaining similar results to the upper bound model trained and inferred with oracle LID. We further explore our method on trilingual, quadrilingual, and pentalingual models, and observe similar advantages as in the bilingual models, which demonstrates the easy extension to more languages.

show abstract

Multilinguality in Misinformation Detection

Ekbal,

Kumari

2024

The Information Retrieval Series

View full text Add to dashboard Cite

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Cited by 14 publications

References 55 publications

Towards Zero-Shot Code-Switched Speech Recognition

Towards Zero-Shot Code-Switched Speech Recognition

Improving Multilingual Transformer Transducer Models by Reducing Language Confusions

Multilinguality in Misinformation Detection

Contact Info

Product

Resources

About