Efficient Knowledge Distillation for RNN-Transducer Models

Panchapagesan, Sankaran; Park, Daniel; Chiu, Chung‐Cheng; Shangguan, Yuan; Liang, Qiao; Gruenstein, Alexander

doi:10.1109/icassp39728.2021.9413905

Cited by 22 publications

(16 citation statements)

References 26 publications

(40 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…where y is the ground-truth sequence of tokens, whose length is U , P 0 and P ∞ ∈ R U ×V are the streaming mode and fullcontext mode outputs, respectively, Ltrans(•, •) is the transducer loss and Ldistil(•, •) is the inplace knowledge distillation loss. Instead of taking the direct KL-divergence between P 0 and P ∞ , Dual-mode ASR follows [17] to merge the probabilities of unimportant tokens for the efficient knowledge distillation…”

Section: Dual-mode Asrmentioning

confidence: 99%

“…In practice, although varying depending on applications, latency requirements usually sit at around 300ms (median) and less than 1s (95%-tile). In order to solve this challenge, several methods have been studied, especially based on joint training [16] and knowledge distillation [17]. Recently, a framework called Dual-mode ASR has been introduced, where a single model is trained with two different modes: streaming and full-context [18].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Multi-Mode Transformer Transducer with Stochastic Future Context

Kim¹,

Wu²,

Sridhar³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

Automatic speech recognition (ASR) models make fewer errors when more surrounding speech information is presented as context. Unfortunately, acquiring a larger future context leads to higher latency. There exists an inevitable trade-off between speed and accuracy. Naïvely, to fit different latency requirements, people have to store multiple models and pick the best one under the constraints. Instead, a more desirable approach is to have a single model that can dynamically adjust its latency based on different constraints, which we refer to as Multimode ASR. A Multi-mode ASR model can fulfill various latency requirements during inference -when a larger latency becomes acceptable, the model can process longer future context to achieve higher accuracy and when a latency budget is not flexible, the model can be less dependent on future context but still achieve reliable accuracy. In pursuit of Multi-mode ASR, we propose Stochastic Future Context, a simple training procedure that samples one streaming configuration in each iteration. Through extensive experiments on AISHELL-1 and Lib-riSpeech datasets, we show that a Multi-mode ASR model rivals, if not surpasses, a set of competitive streaming baselines trained with different latency budgets.

show abstract

Section: Dual-mode Asrmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Multi-Mode Transformer Transducer with Stochastic Future Context

Kim¹,

Wu²,

Sridhar³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

show abstract

“…Knowledge distillation (KD) techniques have been used in the context of speech recognition for model compression [16,17,18], domain adaptation [19,20,21,22,23] and transferring knowledge from full-context to streaming scenarios [24,25]. These methods have applied KD both at the sequence level [17,18], and the frame-level [16,23]. The early works on sequence level KD [26,24] used a two-step procedure.…”

Section: Related Workmentioning

confidence: 99%

“…The early works on sequence level KD [26,24] used a two-step procedure. However, a recently proposed method by Panchapagesan et al [18] allows for single-step co-distillation in RNNT models. Yu et al [25] used this loss function for training encoder modules capable of working in both streaming and full-context speech recognition scenarios.…”

Section: Related Workmentioning

confidence: 99%

Collaborative Training of Acoustic Encoders for Speech Recognition

Nagaraja¹,

Shi²,

Venkatesh³

et al. 2021

Interspeech 2021

View full text Add to dashboard Cite

On-device speech recognition requires training models of different sizes for deploying on devices with various computational budgets. When building such different models, we can benefit from training them jointly to take advantage of the knowledge shared between them. Joint training is also efficient since it reduces the redundancy in the training procedure's data handling operations. We propose a method for collaboratively training acoustic encoders of different sizes for speech recognition. We use a sequence transducer setup where different acoustic encoders share a common predictor and joiner modules. The acoustic encoders are also trained using co-distillation through an auxiliary task for frame level chenone prediction, along with the transducer loss. We perform experiments using the Lib-riSpeech corpus and demonstrate that the collaboratively trained acoustic encoders can provide up to a 11% relative improvement in the word error rate on both the test partitions.

show abstract

“…Hard target distillation was also used in the follow-up works [11,13] that further improved the SoTA results on Lib-riSpeech by combining with pre-training. More recently, soft target distillation for RNN-T was explored in [15,16], where the KL divergence between the teacher and student output label distribution is used as the loss function, similar to those used in [4]. However, it was only used for model compression [15] and streaming ASR models [16].…”

Section: Introductionmentioning

confidence: 99%

Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR

Hwang¹,

Sim²,

Strohman³

2022

Preprint

View full text Add to dashboard Cite

Knowledge distillation is an effective machine learning technique to transfer knowledge from a teacher model to a smaller student model, especially with unlabeled data. In this paper, we focus on knowledge distillation for the RNN-T model, which is widely used in state-ofthe-art (SoTA) automatic speech recognition (ASR). Specifically, we compared using soft and hard target distillation to train large-scale RNN-T models on the LibriSpeech/LibriLight public dataset (60k hours) and our in-house data (600k hours). We found that hard targets are more effective when the teacher and student have different architecture, such as large teacher and small streaming student. On the other hand, soft target distillation works better in self-training scenario like iterative large teacher training. For a large model with 0.6B weights, we achieve a new SoTA word error rate (WER) on LibriSpeech (8% relative improvement on dev-other) using Noisy Student Training with soft target distillation. It also allows our production teacher to adapt new data domain continuously.

show abstract

Efficient Knowledge Distillation for RNN-Transducer Models

Cited by 22 publications

References 26 publications

Multi-Mode Transformer Transducer with Stochastic Future Context

Multi-Mode Transformer Transducer with Stochastic Future Context

Collaborative Training of Acoustic Encoders for Speech Recognition

Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR

Contact Info

Product

Resources

About