Advancing RNN Transducer Technology for Speech Recognition

Saon, George; Tueske, Zoltan; Bolaños, Daniel; Kingsbury, Brian

doi:10.48550/arxiv.2103.09935

Cited by 7 publications

(14 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The prediction network is a single unidirectional LSTM layer with 768 cells. Encoder and prediction network outputs are combined multiplicatively in a joint network [3], with an FC layer and log-Softmax over 46 output characters. We train for 20 epochs with batch size 64, using AdamW and a triangular LR policy (OneCy-cleLR), on the audio and character-level transcripts from the SWB corpus, augmented with speed and tempo perturbation [21], SpecAugment [22], and Sequence Noise Injection [23].…”

Section: Speech Modelsmentioning

confidence: 99%

“…In the acoustic encoder, all LSTM layers beside the first are quantized to 4 bits; this dramatically increases their computation throughput and reduces encoder runtime by 2.6× (blue bars). Due to the iterative beam search (decoding) process [3], the decoder runtime increases significantly with beam width. Thanks to the quantized prediction network, the decoding time (red bars) scales well between FP16 and INT4, achieving 3.3× speed-up, mitigating the impact of wider beams.…”

Section: Inference Performance In End-to-end Modelsmentioning

confidence: 99%

“…Overall, the INT4 implementation achieves 2.6× speed-up compared to FP16 across different beam widths. 3…”

Section: Inference Performance In End-to-end Modelsmentioning

confidence: 99%

“…Neural Network (NN) models for ASR have achieved tremendous success in the past decade, significantly reducing the gap with human performance [1,2]. Long Short-Term Memory (LSTM) building blocks are at the core of ASR models, either in NN-HMM hybrid [1] or NN end-to-end form [2,3], delivering state-of-the-art performance on standard speech benchmarks, such as the SWB corpus [4].…”

Section: Introductionmentioning

confidence: 99%

“…This decisively mitigates the degradation (grey line). (2) We apply QAT to fine-tune a state-of-the-art pre-trained model (red line)1 ; it has been shown that starting from a pre-trained model requires significantly fewer train iterations[28]; this is particularly important for RNN-T.(3) Instead of OneCycleLR[3], we use a custom LR policy (see Sec. 3.3 and Fig.3(d)) which achieves better performance (purple line).…”

mentioning

confidence: 99%

See 4 more Smart Citations

4-bit Quantization of LSTM-based Speech Recognition Models

Fasoli¹,

Chen²,

Serrano³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM -Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network -Transducers (RNN-Ts). Using a 4-bit integer representation, a naïve quantization approach applied to the LSTM portion of these models results in significant Word Error Rate (WER) degradation. On the other hand, we show that minimal accuracy loss is achievable with an appropriate choice of quantizers and initializations. In particular, we customize quantization schemes depending on the local properties of the network, improving recognition performance while limiting computational time. We demonstrate our solution on the Switchboard (SWB) and CallHome (CH) test sets of the NIST Hub5-2000 evaluation. DBLSTM-HMMs trained with 300 or 2000 hours of SWB data achieves <0.5% and <1% average WER degradation, respectively. On the more challenging RNN-T models, our quantization strategy limits degradation in 4-bit inference to 1.3%.

show abstract

Section: Speech Modelsmentioning

confidence: 99%

Section: Inference Performance In End-to-end Modelsmentioning

confidence: 99%

“…Overall, the INT4 implementation achieves 2.6× speed-up compared to FP16 across different beam widths. 3…”

Section: Inference Performance In End-to-end Modelsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations

4-bit Quantization of LSTM-based Speech Recognition Models

Fasoli¹,

Chen²,

Serrano³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

End-to-End Speech Recognition: A Survey

Prabhavalkar,

Hori,

Sainath

et al. 2024

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

In the last decade of automatic speech recognition (ASR) research, the introduction of deep learning has brought considerable reductions in word error rate of more than 50% relative, compared to modeling without deep learning. In the wake of this transition, a number of all-neural ASR architectures have been introduced. These so-called end-to-end (E2E) models provide highly integrated, completely neural ASR models, which rely strongly on general machine learning knowledge, learn more consistently from data, with lower dependence on ASR domainspecific experience. The success and enthusiastic adoption of deep learning, accompanied by more generic model architectures has led to E2E models now becoming the prominent ASR approach. The goal of this survey is to provide a taxonomy of E2E ASR models and corresponding improvements, and to discuss their properties and their relationship to classical hidden Markov model (HMM) based ASR architectures. All relevant aspects of E2E ASR are covered in this work: modeling, training, decoding, and external language model integration, discussions of performance and deployment opportunities, as well as an outlook into potential future developments.

show abstract

Towards Consistent Hybrid HMM Acoustic Modeling

Raissi,

Beck,

Schlüter

et al. 2021

Preprint

View full text Add to dashboard Cite

High-performance hybrid automatic speech recognition (ASR) systems are often trained with clustered triphone outputs, and thus require a complex training pipeline to generate the clustering. The same complex pipeline is often utilized in order to generate an alignment for use in frame-wise cross-entropy training. In this work, we propose a flat-start factored hybrid model trained by modeling the full set of triphone states explicitly without relying on clustering methods. This greatly simplifies the training of new models. Furthermore, we study the effect of different alignments used for Viterbi training. Our proposed models achieve competitive performance on the Switchboard task compared to systems using clustered triphones and other flat-start models in the literature.

show abstract

Advancing RNN Transducer Technology for Speech Recognition

Cited by 7 publications

References 28 publications

4-bit Quantization of LSTM-based Speech Recognition Models

4-bit Quantization of LSTM-based Speech Recognition Models

End-to-End Speech Recognition: A Survey

Towards Consistent Hybrid HMM Acoustic Modeling

Contact Info

Product

Resources

About