2021
DOI: 10.48550/arxiv.2103.09935
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Advancing RNN Transducer Technology for Speech Recognition

Abstract: We investigate a set of techniques for RNN Transducers (RNN-Ts) that were instrumental in lowering the word error rate on three different tasks (Switchboard 300 hours, conversational Spanish 780 hours and conversational Italian 900 hours). The techniques pertain to architectural changes, speaker adaptation, language model fusion, model combination and general training recipe. First, we introduce a novel multiplicative integration of the encoder and prediction network vectors in the joint network (as opposed to… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
12
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1

Relationship

1
4

Authors

Journals

citations
Cited by 7 publications
(14 citation statements)
references
References 28 publications
0
12
0
Order By: Relevance
“…The prediction network is a single unidirectional LSTM layer with 768 cells. Encoder and prediction network outputs are combined multiplicatively in a joint network [3], with an FC layer and log-Softmax over 46 output characters. We train for 20 epochs with batch size 64, using AdamW and a triangular LR policy (OneCy-cleLR), on the audio and character-level transcripts from the SWB corpus, augmented with speed and tempo perturbation [21], SpecAugment [22], and Sequence Noise Injection [23].…”
Section: Speech Modelsmentioning
confidence: 99%
See 4 more Smart Citations
“…The prediction network is a single unidirectional LSTM layer with 768 cells. Encoder and prediction network outputs are combined multiplicatively in a joint network [3], with an FC layer and log-Softmax over 46 output characters. We train for 20 epochs with batch size 64, using AdamW and a triangular LR policy (OneCy-cleLR), on the audio and character-level transcripts from the SWB corpus, augmented with speed and tempo perturbation [21], SpecAugment [22], and Sequence Noise Injection [23].…”
Section: Speech Modelsmentioning
confidence: 99%
“…In the acoustic encoder, all LSTM layers beside the first are quantized to 4 bits; this dramatically increases their computation throughput and reduces encoder runtime by 2.6× (blue bars). Due to the iterative beam search (decoding) process [3], the decoder runtime increases significantly with beam width. Thanks to the quantized prediction network, the decoding time (red bars) scales well between FP16 and INT4, achieving 3.3× speed-up, mitigating the impact of wider beams.…”
Section: Inference Performance In End-to-end Modelsmentioning
confidence: 99%
See 3 more Smart Citations