Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-11249
|View full text |Cite
|
Sign up to set email alerts
|

Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification

Abstract: Language identification is critical for many downstream tasks in automatic speech recognition (ASR), and is beneficial to integrate into multilingual end-to-end ASR as an additional task. In this paper, we propose to modify the structure of the cascadedencoder-based recurrent neural network transducer (RNN-T) model by integrating a per-frame language identifier (LID) predictor. RNN-T with cascaded encoders can achieve streaming ASR with low latency using first-pass decoding with no right-context, and achieve l… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 14 publications
(11 citation statements)
references
References 55 publications
0
8
0
Order By: Relevance
“…( 6) of [31]; we denote this merged CTC likelihood as PCTC(Z|X). We then jointly decode PCTC(•) with an external bilingual LM, PB LM(Y ), using the time-synchronous beam search described in [43], which approximates the following decision: (12) where {V M ∪ V E } * denotes the set of all possible bilingual outputs. 3 This architecture, which we refer to as Conditional CTC, is 1 Unlike text-based transliteration [35], pseudo-labeling relies solely on the resources presumed to be available in our zero-shot CS ASR settings.…”
Section: Conditional Ctc With External Lm Architecturementioning
confidence: 99%
See 1 more Smart Citation
“…( 6) of [31]; we denote this merged CTC likelihood as PCTC(Z|X). We then jointly decode PCTC(•) with an external bilingual LM, PB LM(Y ), using the time-synchronous beam search described in [43], which approximates the following decision: (12) where {V M ∪ V E } * denotes the set of all possible bilingual outputs. 3 This architecture, which we refer to as Conditional CTC, is 1 Unlike text-based transliteration [35], pseudo-labeling relies solely on the resources presumed to be available in our zero-shot CS ASR settings.…”
Section: Conditional Ctc With External Lm Architecturementioning
confidence: 99%
“…Therefore, a preeminent challenge in the CS ASR field is to build effective systems under the zero-shot setting where no CS ASR training data is available. Recent advancements in multilingual speech recognition have demonstrated the impressive scale of cross-lingual sharing in neural network approaches [3][4][5][6][7][8][9][10][11][12], and these works have shown that jointly modeling ASR with language identity (LID) grants some intra-sentential CS ability [11][12][13]. However, most of these large scale models skew towards highresourced languages [9] and do not seek to directly optimize for intra-sentential CS ASR between particular language pairs.…”
Section: Introductionmentioning
confidence: 99%
“…A neural transducer model [4] has three components: an acoustic encoder, a label prediction network, and a joint network. Neural transducer models can use different types of models as encoders such as LSTMs in RNN-T [4] and transformers [7,8,9,17,20,21,22] in transformer transducer (T-T). In this study, we use T-T as the backbone model for the development.…”
Section: Transformer Transducer Modelmentioning
confidence: 99%
“…While end-to-end (E2E) models have made rapid progress in automatic speech recognition (ASR) [1][2][3][4][5][6][7][8], there are large amount of demands of multilingual ASR models since there are more than 60% people in the world can speak more than 2 languages according to [10]. There have been plenty of efforts to develop E2E multilingual models [11][12][13][14][15][16][17][18][19][20][21][22][23][24][25][26], and these models can achieve the comparable or even better ASR performance than monolingual baselines by passing the language identification (LID) information in the form of a one-hot or learnable embedding vector to distinguish different languages. In order to build streaming multilingual ASR systems for lots of practical applications that can perform similarly as the monolingual ones, we should not request users to input any LID information during model inference.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation