Monotonic Recurrent Neural Network Transducer and Decoding Strategies

Tripathi, Anshuman; Han, Lu; Sak, Haşim; Soltau, Hagen

doi:10.1109/asru46091.2019.9003822

Cited by 46 publications

(42 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Decoding: All results are reported after decoding models using the breadth-first search decoding algorithm [15]. 2 The limited context models are always evaluated using the path-merging process proposed in Section 3.1.…”

Section: Methodsmentioning

confidence: 99%

“…Traditional decoding algorithms for RNN-T [1,15] only produce trees that are rooted at the sos label since distinct label sequences result in unique model states (i.e., the state of the prediction network, since the encoder state is not conditioned on the label sequence). In a limited context model, however, model states are identical if two paths on the beam share the same local label history.…”

Section: Decoding With Path Merging To Create Latticesmentioning

confidence: 99%

“…The increase in modeling power afforded by conditioning on all previous predictions, however, comes at the cost of a more complicated decoding process; computing the most likely label sequence exactly is intractable since it involves a discrete search over an exponential number of sequences each of which corresponds to a distinct model state. Instead, decoding is performed using an approximate beam search [13], with various heuristics to improve performance [14,15]. Since model states corresponding to different label histories are unique, beam search decoding produces a tree of hypotheses rooted at the start of sentence label ( sos ).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

Prabhavalkar

Rybach

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

End-to-end models that condition the output sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR). Since distinct label histories correspond to distinct models states, such models are decoded using an approximate beam-search which produces a tree of hypotheses.In this work, we study the influence of the amount of label context on the model's accuracy, and its impact on the efficiency of the decoding process. We find that we can limit the context of the recurrent neural network transducer (RNN-T) during training to just four previous word-piece labels, without degrading word error rate (WER) relative to the full-context baseline. Limiting context also provides opportunities to improve decoding efficiency by removing redundant paths from the active beam, and instead retaining them in the final lattice. This path-merging scheme can also be applied when decoding the baseline full-context model through an approximation. Overall, we find that the proposed path-merging scheme is extremely effective, allowing us to improve oracle WERs by up to 36% over the baseline, while simultaneously reducing the number of model evaluations by up to 5.3% without any degradation in WER, or up to 15.7% when lattice rescoring is applied.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Decoding With Path Merging To Create Latticesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

Prabhavalkar

Rybach

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…We break the transducer lattice rule a little bit in decoding. One frame only outputs one phone label or blank [25].…”

Section: Phone Synchronous Decoding With Blank Skippingmentioning

confidence: 99%

Tiny Transducer: A Highly-Efficient Speech Recognition Model on Edge Devices

Zhang

Sun

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper proposes an extremely lightweight phonebased transducer model with a tiny decoding graph on edge devices. First, a phone synchronous decoding (PSD) algorithm based on blank label skipping is first used to speed up the transducer decoding process. Then, to decrease the deletion errors introduced by the high blank score, a blank label deweighting approach is proposed. To reduce parameters and computation, deep feedforward sequential memory network (DFSMN) layers are used in the transducer encoder, and a CNN-based stateless predictor is adopted. SVD technology compresses the model further. WFST-based decoding graph takes the context-independent (CI) phone posteriors as input and allows us to flexibly bias user-specific information. Finally, with only 0.9M parameters after SVD, our system could give a relative 9.1% -20.5% improvement compared with a bigger conventional hybrid system on edge devices.

show abstract

“…This led to a rapidly evolving research landscape in end-to-end modeling for ASR with Recurrent Neural Network Transducers (RNN-T) [1] and attention-based models [2,3] being the most prominent examples. Attention based models are excellent at handling non-monotonic alignment problems such as translation [4], whereas RNN-Ts are an ideal match for the left-to-right nature of speech [5][6][7][8][9][10][11][12][13][14][15][16][17].…”

Section: Introductionmentioning

confidence: 99%

Advancing RNN Transducer Technology for Speech Recognition

Saon¹,

Tüske²,

Bolaños³

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

We investigate a set of techniques for RNN Transducers (RNN-Ts) that were instrumental in lowering the word error rate on three different tasks (Switchboard 300 hours, conversational Spanish 780 hours and conversational Italian 900 hours). The techniques pertain to architectural changes, speaker adaptation, language model fusion, model combination and general training recipe. First, we introduce a novel multiplicative integration of the encoder and prediction network vectors in the joint network (as opposed to additive). Second, we discuss the applicability of i-vector speaker adaptation to RNN-Ts in conjunction with data perturbation. Third, we explore the effectiveness of the recently proposed density ratio language model fusion for these tasks. Last but not least, we describe the other components of our training recipe and their effect on recognition performance. We report a 5.9% and 12.5% word error rate on the Switchboard and CallHome test sets of the NIST Hub5 2000 evaluation and a 12.7% WER on the Mozilla CommonVoice Italian test set.

show abstract

Monotonic Recurrent Neural Network Transducer and Decoding Strategies

Cited by 46 publications

References 4 publications

Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

Tiny Transducer: A Highly-Efficient Speech Recognition Model on Edge Devices

Advancing RNN Transducer Technology for Speech Recognition

Contact Info

Product

Resources

About