Specaugment on Large Scale Datasets

Park, Daniel; Zhang, Yu; Chiu, Chung‐Cheng; Chen, Youzheng; Li, Bo; Chan, William; Le, Quoc V.; Wu, Yonghui

doi:10.1109/icassp40776.2020.9053205

Cited by 115 publications

(75 citation statements)

References 22 publications

(36 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Acoustic features are 64dimensional log-mel filterbanks with a frame shift of 10ms which are stacked and downsampled by a factor of 3. For feature augmentation we employ LibriFullAdapt SpecAugment policy from [21]. We use Adam algorithm [22] for optimization of all models, and the learning rate is scheduled based on warm-up, hold and decay strategy as proposed in [23].…”

Section: Methodsmentioning

confidence: 99%

Streaming Multi-Speaker ASR with RNN-T

Sklyar

Piunova

Liu

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Recent research shows end-to-end ASR systems can recognize overlapped speech from multiple speakers. However, all published works have assumed no latency constraints during inference, which does not hold for most voice assistant interactions. This work focuses on multi-speaker speech recognition based on a recurrent neural network transducer (RNN-T) that has been shown to provide high recognition accuracy at a low latency online recognition regime. We investigate two approaches to multi-speaker model training of the RNN-T: deterministic output-target assignment and permutation invariant training. We show that guiding separation with speaker order labels in the former case enhances the high-level speaker tracking capability of RNN-T. Apart from that, with multistyle training on single-and multi-speaker utterances, the resulting models gain robustness against ambiguous numbers of speakers during inference. Our best model achieves a WER of 10.2% on simulated 2-speaker LibriSpeech data, which is competitive with the previously reported state-of-the-art nonstreaming model (10.3%), while the proposed model could be directly applied for streaming applications.

show abstract

Section: Methodsmentioning

confidence: 99%

Streaming Multi-Speaker ASR with RNN-T

Sklyar

Piunova

Liu

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…In order to reduce the effective frame rate, features from four adjacent frames are concatenated together (to produce 512 dimensional features), which are further sub-sampled by a factor of 3, so that the effective input frame rate is 30ms. In this work, we also apply SpecAugment masks [9] using the configuration described in [31], which we find to improve performance over the system in [12]. The encoder network in all of our experiments is modeled using a stack of 8 unidirectional LSTM [29] layers, each of which contains 2,048 units and a projection layer of 640 units.…”

Section: Methodsmentioning

confidence: 99%

Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

Prabhavalkar

Rybach

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

End-to-end models that condition the output sequence on all previously predicted labels have emerged as popular alternatives to conventional systems for automatic speech recognition (ASR). Since distinct label histories correspond to distinct models states, such models are decoded using an approximate beam-search which produces a tree of hypotheses.In this work, we study the influence of the amount of label context on the model's accuracy, and its impact on the efficiency of the decoding process. We find that we can limit the context of the recurrent neural network transducer (RNN-T) during training to just four previous word-piece labels, without degrading word error rate (WER) relative to the full-context baseline. Limiting context also provides opportunities to improve decoding efficiency by removing redundant paths from the active beam, and instead retaining them in the final lattice. This path-merging scheme can also be applied when decoding the baseline full-context model through an approximation. Overall, we find that the proposed path-merging scheme is extremely effective, allowing us to improve oracle WERs by up to 36% over the baseline, while simultaneously reducing the number of model evaluations by up to 5.3% without any degradation in WER, or up to 15.7% when lattice rescoring is applied.

show abstract

“…We tuned the loss parameter β over three different values (1e-2, 1e-3 and 1e-4), while keeping all other training parameters unchanged and found that best performance is achieved at β = 1e-3. An adaptive SpecAugment [29,30] policy with two frequency masks with mask parameter F = 27, and ten time masks with maximum time-mask ratio p S = 0.05 has been used to augment the input, which was shared by the teacher and student model. The performance of the trained network is recorded in table 1.…”

Section: Librispeech Experimentsmentioning

confidence: 99%

“…Models were trained on a large multi-domain dataset similar to that described in [12], where the domains include Search and FarField. The shared input of the teacher and student models are augmented using SpecAugment [29,30] and multi-style training [31]. The architecture of our uncompressed 0% sparse RNN-T model, also the teacher model for distillation, is similar to that described in [25], and is as follows.…”

Section: Multi-domain Experimentsmentioning

confidence: 99%

Efficient Knowledge Distillation for RNN-Transducer Models

Panchapagesan

Park

Chiu

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Knowledge Distillation is an effective method of transferring knowledge from a large model to a smaller model. Distillation can be viewed as a type of model compression, and has played an important role for on-device ASR applications. In this paper, we develop a distillation method for RNN-Transducer (RNN-T) models, a popular end-to-end neural network architecture for streaming speech recognition. Our proposed distillation loss is simple and efficient, and uses only the "y" and "blank" posterior probabilities from the RNN-T output probability lattice. We study the effectiveness of the proposed approach in improving the accuracy of sparse RNN-T models obtained by gradually pruning a larger uncompressed model, which also serves as the teacher during distillation. With distillation of 60% and 90% sparse multi-domain RNN-T models, we obtain WER reductions of 4.3% and 12.1% respectively, on a noisy FarField eval set. We also present results of experiments on LibriSpeech, where the introduction of the distillation loss yields a 4.8% relative WER reduction on the test-other dataset for a small Conformer model.

show abstract

Specaugment on Large Scale Datasets

Cited by 115 publications

References 22 publications

Streaming Multi-Speaker ASR with RNN-T

Streaming Multi-Speaker ASR with RNN-T

Less is More: Improved RNN-T Decoding Using Limited Label Context and Path Merging

Efficient Knowledge Distillation for RNN-Transducer Models

Contact Info

Product

Resources

About