FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

Lee, Yeonghyeon; Kangwook, Jang,; Goo, Jahyun; Jung, Yongju; Kim, Hoirin

doi:10.48550/arxiv.2207.00555

Cited by 2 publications

(3 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Relation to prior work. There are several previous studies that investigate SSL speech model compression [28,20,29,30] through sparsity, knowledge distillation, attention re-use, or their combinations. Our proposed study differs from them in several aspects.…”

Section: Related Workmentioning

confidence: 99%

4-bit Conformer with Native Quantization Aware Training for Speech Recognition

Ding¹,

Phoenix²,

He³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

End-to-end automatic speech recognition (ASR) models have seen revolutionary quality gains with the recent development of large-scale universal speech models (USM). However, deploying these massive USMs is extremely expensive due to the enormous memory usage and computational cost. Therefore, model compression is an important research topic to fit USM-based ASR under budget in real-world scenarios. In this study, we propose a USM fine-tuning approach for ASR, with a low-bit quantization and N :M structured sparsity aware paradigm on the model weights, reducing the model complexity from parameter precision and matrix topology perspectives. We conducted extensive experiments with a 2-billion parameter USM on a largescale voice search dataset to evaluate our proposed method. A series of ablation studies validate the effectiveness of up to int4 quantization and 2:4 sparsity. However, a single compression technique fails to recover the performance well under extreme setups including int2 quantization and 1:4 sparsity. By contrast, our proposed method can compress the model to have 9.4% of the size, at the cost of only 7.3% relative word error rate (WER) regressions. We also provided in-depth analyses on the results and discussions on the limitations and potential solutions, which would be valuable for future studies.

show abstract

Section: Related Workmentioning

confidence: 99%

4-bit Conformer with Native Quantization Aware Training for Speech Recognition

Ding¹,

Phoenix²,

He³

et al. 2022

Interspeech 2022

View full text Add to dashboard Cite

show abstract

“…Inspired by works on distilling speech models with smaller sampling rates [8,10], and given the quadratic memory bottleneck of transformers architectures as a function of input lengths, we assess the capacity of the SSL model, trained on 16-kHz audio inputs, to adapt to lower sampling rates. Given a speech file x consisting in T speech samples x = (xi) i∈[1,T ] and a downsampling factor k, a function f , learned or unlearned depending on the chosen method, downsamples x to x = f (x) = (x i ) i∈[1, T /k ] , a sequence of size T /k .…”

Section: Sequence Downsamplingmentioning

confidence: 99%

“…As a matter of fact, several approaches have been proposed to shorten inference times using SSL models. Some attempted to distill state-of-the-art models by using shallower or thinner networks [7,8] or through downsampling the inputs [9,10]. However, while the downstream performance of distilled student models is comparable to larger teacher models on most speech classification tasks, a large gap is still witnessed for more complex tasks such as ASR [11].…”

Section: Introductionmentioning

confidence: 99%

Conditional Independence for Pretext Task Selection in Self-Supervised Speech Representation Learning

Zaiem¹,

Parcollet²,

Essid³

2021

Interspeech 2021

View full text Add to dashboard Cite

Through solving pretext tasks, self-supervised learning (SSL) leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. A common pretext task consists in pretraining a SSL model on pseudo-labels derived from the original signal. This technique is particularly relevant for speech data where various meaningful signal processing features may serve as pseudolabels. However, the process of selecting pseudo-labels, for speech or other types of data, remains mostly unexplored and currently relies on observing the results on the final downstream task. Nevertheless, this methodology is not sustainable at scale due to substantial computational (hence carbon) costs. Thus, this paper introduces a practical and theoretical framework to select relevant pseudo-labels with respect to a given downstream task. More precisely, we propose a functional estimator of the pseudo-label utility grounded in the conditional independence theory, which does not require any training. The experiments conducted on speaker recognition and automatic speech recognition validate our estimator, showing a significant correlation between the performance observed on the downstream task and the utility estimates obtained with our approach, facilitating the prospection of relevant pseudo-labels for selfsupervised speech representation learning.

show abstract

FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

Cited by 2 publications

References 0 publications

4-bit Conformer with Native Quantization Aware Training for Speech Recognition

4-bit Conformer with Native Quantization Aware Training for Speech Recognition

Conditional Independence for Pretext Task Selection in Self-Supervised Speech Representation Learning

Contact Info

Product

Resources

About