Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-2858
|View full text |Cite
|
Sign up to set email alerts
|

Large-Scale Multilingual Speech Recognition with a Streaming End-to-End Model

Abstract: Multilingual end-to-end (E2E) models have shown great promise in expansion of automatic speech recognition (ASR) coverage of the world's languages. They have shown improvement over monolingual systems, and have simplified training and serving by eliminating language-specific acoustic, pronunciation, and language models. This work presents an E2E multilingual system which is equipped to operate in low-latency interactive applications, as well as handle a key challenge of real world data: the imbalance in traini… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
65
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 125 publications
(65 citation statements)
references
References 33 publications
0
65
0
Order By: Relevance
“…Under the context of ASR systems, multilingual systems [2,3,4] are developed to capture and utilize this information in common to facilitate ASR systems for low-resource languages. Recently, several large-scale systems have been introduced for multilingual ASR [5]. Pretap et al [6] introduced a massive single E2E model with up to 1 billion parameters trained on 50 languages.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Under the context of ASR systems, multilingual systems [2,3,4] are developed to capture and utilize this information in common to facilitate ASR systems for low-resource languages. Recently, several large-scale systems have been introduced for multilingual ASR [5]. Pretap et al [6] introduced a massive single E2E model with up to 1 billion parameters trained on 50 languages.…”
Section: Introductionmentioning
confidence: 99%
“…[11] introduced the adapter module for parameter-efficient domain adaptation in machine translation, where only few parameters are introduced for each target domain. In [5,12], the authors used language-specific adapters to enhance the performance on each language for a multilingual ASR model.…”
Section: Introductionmentioning
confidence: 99%
“…With the exception of a few recent works [5,6,7], most previous work on multilingual speech recognition focuses on the benefits of these models for lower-resource or related languages. Nevertheless, in order for these models to be utilized in real-world scenarios and replace their monolingual counterparts, they need to target a variety of languages, with large • Introduction of an informed mixture-of-experts layer, used in the encoder of an RNN-T model, where each expert is assigned to one language, or set of related languages.…”
Section: Introductionmentioning
confidence: 99%
“…A variety of approaches have been explored for changing the structure of the neural network model to make it more amenable to multilingual modeling. In the context of encoderdecoder models, [5] used adapter layers to account for different amounts of available data per language, [9] parameterized the attention heads of a Transformer-based encoder to be per-language, while [6] showed that a multi-decoder multilingual model, where each decoder is assigned to a cluster of languages, can achieve good performance. Since the introduction of Mixture of Experts (MOE) in [10], these models have found popularity in machine translation [8], and speech recognition [11,12].…”
Section: Introductionmentioning
confidence: 99%
“…RNN-T overcomes the conditional independence assumption of CTC with the prediction network; moreover, it allows streaming ASR because it still preforms frame-level monotonic decoding. Hence, there has been a significant research effort in promoting this approach in the ASR community [22,21,25,26,27], and RNN-T has recently been successfully deployed in embedding devices [28].…”
Section: Introductionmentioning
confidence: 99%