ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054419
|View full text |Cite
|
Sign up to set email alerts
|

Rnn-Transducer with Stateless Prediction Network

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

6
38
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 71 publications
(47 citation statements)
references
References 10 publications
6
38
1
Order By: Relevance
“…Using 5-gram contexts provides the best balance between the two evaluation metrics, allowing the model to improve oracle WERs by between 14.3-36.4% across the various test sets, with larger improvements on the longer NVS/NVS-hard sets. We also note that our results are in contrast to previous studies which have been conducted on smaller-scale tasks with limited training data [6,16].…”
Section: Resultscontrasting
confidence: 99%
See 1 more Smart Citation
“…Using 5-gram contexts provides the best balance between the two evaluation metrics, allowing the model to improve oracle WERs by between 14.3-36.4% across the various test sets, with larger improvements on the longer NVS/NVS-hard sets. We also note that our results are in contrast to previous studies which have been conducted on smaller-scale tasks with limited training data [6,16].…”
Section: Resultscontrasting
confidence: 99%
“…The first part of this question has been investigated in a few recent papers, in different contexts. Ghodsi et al [16] find that in low-resource settings, where training data is limited, the use of word-piece units [17] allows for a stateless prediction network (i.e., one which conditions on only one previous label) without a significant loss in accuracy. Zhang et al [6] investigate the impact of varying label context in the transformer-transducer model (RNN-T which replaces LSTMs with transformer networks [18]) finding that a context of 3-4 previous graphemes achieves similar performance as a full-context baseline on the Librispeech dataset.…”
Section: Introductionmentioning
confidence: 99%
“…Then, to alleviate the deletion error caused by the over-scored blank prediction, we propose a blank label deweighting approach during speech transducer decoding, which can reduce the deletion error significantly in our experiments. To reduce model parameters and computation, a deep feedforward sequential memory block (DFSMN) is used to replace the RNN encoder, and a casual 1-D CNN-based (Conv1d) stateless predictor [17,18] is adopted. Finally, we apply the singular value decomposition (SVD) to our speech transducer to further compress the model.…”
Section: Introductionmentioning
confidence: 99%
“…In recent years, keyword spotting gains substantial performance improvements with deep learning algorithms [1,2,3,4,5]. More recently, end-toend trained models have been successfully applied in automatic speech recognition (ASR) [6,7,8,9,10] and KWS [11,12,13,14]. In "online" (streaming) scenarios where the speech signal must be processed in real-time, RNN-T has been one of the most popular end-to-end modeling methods and has shown promising results in KWS [11,14,15].…”
Section: Introductionmentioning
confidence: 99%
“…As a result, all possible prediction network outputs could be pre-computed and saved for decoding, which saves the storage and computational cost. The stateless prediction network structure has already been investigated in ASR and shows slight performance degradation [10]. Third, to further reduce over-fitting issues and improve the model's generalization capacity, we investigate transfer learning where an RNN-T model trained with nearly 160 thousand hours of speech recognition data is used to initialize the KWS model.…”
Section: Introductionmentioning
confidence: 99%