Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1855
|View full text |Cite
|
Sign up to set email alerts
|

A New Training Pipeline for an Improved Neural Transducer

Abstract: The RNN transducer is a promising end-to-end model candidate. We compare the original training criterion with the full marginalization over all alignments, to the commonly used maximum approximation, which simplifies, improves and speeds up our training. We also generalize from the original neural network model and study more powerful models, made possible due to the maximum approximation. We further generalize the output label topology to cover RNN-T, RNA and CTC. We perform several studies among all these as… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
31
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6

Relationship

1
5

Authors

Journals

citations
Cited by 42 publications
(36 citation statements)
references
References 53 publications
(70 reference statements)
1
31
0
Order By: Relevance
“…The architecture of the trained models is as follows. The encoder contains 6 bidirectional LSTM layers with 640 cells per layer per direction and is initialized with a network trained with CTC based on [29] similar to [5,6,8,15]. The prediction network is a single unidirectional LSTM layer with only 768 cells (this size has been found to be optimal after external LM fusion).…”
Section: Experiments On Switchboard 300 Hoursmentioning
confidence: 99%
See 1 more Smart Citation
“…The architecture of the trained models is as follows. The encoder contains 6 bidirectional LSTM layers with 640 cells per layer per direction and is initialized with a network trained with CTC based on [29] similar to [5,6,8,15]. The prediction network is a single unidirectional LSTM layer with only 768 cells (this size has been found to be optimal after external LM fusion).…”
Section: Experiments On Switchboard 300 Hoursmentioning
confidence: 99%
“…This led to a rapidly evolving research landscape in end-to-end modeling for ASR with Recurrent Neural Network Transducers (RNN-T) [1] and attention-based models [2,3] being the most prominent examples. Attention based models are excellent at handling non-monotonic alignment problems such as translation [4], whereas RNN-Ts are an ideal match for the left-to-right nature of speech [5][6][7][8][9][10][11][12][13][14][15][16][17].…”
Section: Introductionmentioning
confidence: 99%
“…where q is the set of alignments which can be mapped to Y , a subset of all the alignments from CTC output space. To reduce computational cost, the maximum approximation [21] is applied:…”
Section: Trainingmentioning
confidence: 99%
“…With the introduction of self-attention based transformer models [21] and data augmentation techniques like SpecAugment [16], transducers have also seen competitive performance to the attention-based encoder-decoder models [19]. With transducer models, CTC loss has been primarily studied as a pre-training objective [18,26]. This process requires a fine-tuning phase which can be cumbersome due to the decisions involved in neural-network optimization like learning rate, optimizer, etc.…”
Section: Relation To Prior Workmentioning
confidence: 99%
“…Unlike CTC, where each output label is conditionally independent of the others given the input speech, neural transducers condition the output on all the previous labels. Unlike attentionbased encoder-decoder models, transducer models learn explicit input-output alignments, making it robust towards long utterances [18]. In particular, we focus on adapting the transformer-transducer (T-T) model to code-switched ASR [19,20].…”
Section: Introductionmentioning
confidence: 99%