Rnn-Transducer with Stateless Prediction Network

Ghodsi, Mohammadreza; Liu, Xiaofeng; Apfel, James; Cabrera, Rodrigo; Weinstein, Eugene

doi:10.1109/icassp40776.2020.9054419

Cited by 71 publications

(47 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Using 5-gram contexts provides the best balance between the two evaluation metrics, allowing the model to improve oracle WERs by between 14.3-36.4% across the various test sets, with larger improvements on the longer NVS/NVS-hard sets. We also note that our results are in contrast to previous studies which have been conducted on smaller-scale tasks with limited training data [6,16].…”

Section: Resultscontrasting

confidence: 99%

“…The first part of this question has been investigated in a few recent papers, in different contexts. Ghodsi et al [16] find that in low-resource settings, where training data is limited, the use of word-piece units [17] allows for a stateless prediction network (i.e., one which conditions on only one previous label) without a significant loss in accuracy. Zhang et al [6] investigate the impact of varying label context in the transformer-transducer model (RNN-T which replaces LSTMs with transformer networks [18]) finding that a context of 3-4 previous graphemes achieves similar performance as a full-context baseline on the Librispeech dataset.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

Prabhavalkar

Rybach

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

End-to-end models that condition the output sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR). Since distinct label histories correspond to distinct models states, such models are decoded using an approximate beam-search which produces a tree of hypotheses.In this work, we study the influence of the amount of label context on the model's accuracy, and its impact on the efficiency of the decoding process. We find that we can limit the context of the recurrent neural network transducer (RNN-T) during training to just four previous word-piece labels, without degrading word error rate (WER) relative to the full-context baseline. Limiting context also provides opportunities to improve decoding efficiency by removing redundant paths from the active beam, and instead retaining them in the final lattice. This path-merging scheme can also be applied when decoding the baseline full-context model through an approximation. Overall, we find that the proposed path-merging scheme is extremely effective, allowing us to improve oracle WERs by up to 36% over the baseline, while simultaneously reducing the number of model evaluations by up to 5.3% without any degradation in WER, or up to 15.7% when lattice rescoring is applied.

show abstract

Section: Resultscontrasting

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

Prabhavalkar

Rybach

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…Then, to alleviate the deletion error caused by the over-scored blank prediction, we propose a blank label deweighting approach during speech transducer decoding, which can reduce the deletion error significantly in our experiments. To reduce model parameters and computation, a deep feedforward sequential memory block (DFSMN) is used to replace the RNN encoder, and a casual 1-D CNN-based (Conv1d) stateless predictor [17,18] is adopted. Finally, we apply the singular value decomposition (SVD) to our speech transducer to further compress the model.…”

Section: Introductionmentioning

confidence: 99%

Tiny Transducer: A Highly-Efficient Speech Recognition Model on Edge Devices

Zhang

Sun

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

This paper proposes an extremely lightweight phonebased transducer model with a tiny decoding graph on edge devices. First, a phone synchronous decoding (PSD) algorithm based on blank label skipping is first used to speed up the transducer decoding process. Then, to decrease the deletion errors introduced by the high blank score, a blank label deweighting approach is proposed. To reduce parameters and computation, deep feedforward sequential memory network (DFSMN) layers are used in the transducer encoder, and a CNN-based stateless predictor is adopted. SVD technology compresses the model further. WFST-based decoding graph takes the context-independent (CI) phone posteriors as input and allows us to flexibly bias user-specific information. Finally, with only 0.9M parameters after SVD, our system could give a relative 9.1% -20.5% improvement compared with a bigger conventional hybrid system on edge devices.

show abstract

“…In recent years, keyword spotting gains substantial performance improvements with deep learning algorithms [1,2,3,4,5]. More recently, end-toend trained models have been successfully applied in automatic speech recognition (ASR) [6,7,8,9,10] and KWS [11,12,13,14]. In "online" (streaming) scenarios where the speech signal must be processed in real-time, RNN-T has been one of the most popular end-to-end modeling methods and has shown promising results in KWS [11,14,15].…”

Section: Introductionmentioning

confidence: 99%

“…As a result, all possible prediction network outputs could be pre-computed and saved for decoding, which saves the storage and computational cost. The stateless prediction network structure has already been investigated in ASR and shows slight performance degradation [10]. Third, to further reduce over-fitting issues and improve the model's generalization capacity, we investigate transfer learning where an RNN-T model trained with nearly 160 thousand hours of speech recognition data is used to initialize the KWS model.…”

Section: Introductionmentioning

confidence: 99%

Improving RNN Transducer Modeling for Small-Footprint Keyword Spotting

Tian

Yao

Cai

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The recurrent neural network transducer (RNN-T) model has been proved effective for keyword spotting (KWS) recently. However, compared with cross-entropy (CE) or connectionist temporal classification (CTC) based models, the additional prediction network in the RNN-T model increases the model size and computational cost. Besides, since the keyword training data usually only contain the keyword sequence, the prediction network might has over-fitting problems. In this paper, we improve the RNN-T modeling for small-footprint keyword spotting in three aspects. First, to address the overfitting issue, we explore multi-task training where a CTC loss is added to the encoder. The CTC loss is calculated with both KWS data and ASR data, while the RNN-T loss is calculated with ASR data so that only the encoder is augmented with KWS data. Second, we use the feed-forward neural network to replace the LSTM for prediction network modeling. Thus all possible prediction network outputs could be pre-computed for decoding. Third, we further improve the model with transfer learning, where a model trained with 160 thousand hours of ASR data is used to initialize the KWS model. On a self-collected far-field wake-word testset, the proposed RNN-T system greatly improves the performance comparing with a strong "keyword-filler" baseline.

show abstract

Rnn-Transducer with Stateless Prediction Network

Cited by 71 publications

References 10 publications

Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

Tiny Transducer: A Highly-Efficient Speech Recognition Model on Edge Devices

Improving RNN Transducer Modeling for Small-Footprint Keyword Spotting

Contact Info

Product

Resources

About