2022
DOI: 10.48550/arxiv.2207.00555
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 0 publications
0
3
0
Order By: Relevance
“…Relation to prior work. There are several previous studies that investigate SSL speech model compression [28,20,29,30] through sparsity, knowledge distillation, attention re-use, or their combinations. Our proposed study differs from them in several aspects.…”
Section: Related Workmentioning
confidence: 99%
“…Relation to prior work. There are several previous studies that investigate SSL speech model compression [28,20,29,30] through sparsity, knowledge distillation, attention re-use, or their combinations. Our proposed study differs from them in several aspects.…”
Section: Related Workmentioning
confidence: 99%
“…Inspired by works on distilling speech models with smaller sampling rates [8,10], and given the quadratic memory bottleneck of transformers architectures as a function of input lengths, we assess the capacity of the SSL model, trained on 16-kHz audio inputs, to adapt to lower sampling rates. Given a speech file x consisting in T speech samples x = (xi) i∈[1,T ] and a downsampling factor k, a function f , learned or unlearned depending on the chosen method, downsamples x to x = f (x) = (x i ) i∈[1, T /k ] , a sequence of size T /k .…”
Section: Sequence Downsamplingmentioning
confidence: 99%
“…As a matter of fact, several approaches have been proposed to shorten inference times using SSL models. Some attempted to distill state-of-the-art models by using shallower or thinner networks [7,8] or through downsampling the inputs [9,10]. However, while the downstream performance of distilled student models is comparable to larger teacher models on most speech classification tasks, a large gap is still witnessed for more complex tasks such as ASR [11].…”
Section: Introductionmentioning
confidence: 99%