Efficient Training of Audio Transformers with Patchout

Koutini, Khaled; Schlüter, Jan; Eghbal-zadeh, Hamid; Widmer, Gerhard

doi:10.21437/interspeech.2022-227

Cited by 75 publications

(55 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Audio Spectrogram Transformers: Inspired by the Vision Transformer (ViT) [12], transformers capable of processing images have been adapted to the audio domain. Vision and Audio Spectrogram transformers [16,17,18,19] extract overlapping patches with a certain stride and size of the input image, add a positional encoding, and apply transformer layers to the flattened sequence of patches. Transformer layers use a global attention mechanism that leads to computation and memory complexity scaling quadratically with with respect to the input sequence.…”

Section: Related Workmentioning

confidence: 99%

“…We apply the same preprocessing as in [17]. We use mono audio sampled at 32 kHz and compute Mel features from 25 ms windows with a hop size of 10 ms.…”

Section: Training Setupmentioning

confidence: 99%

“…For KD, we ensemble PaSST [17] models with different patch sizes and strides as a teacher. We pre-compute the predictions of 9 different PaSST models 2 on AudioSet to speed up training and form an ensemble by averaging the logits.…”

Section: Knowledge Distillationmentioning

confidence: 99%

“…1. Crosses denote models based on Transformer architecture (Audio-MAE [19], HTS-AT [18], PaSST-S [17], PaSST-S-L [17], AST [16], KD-AST [10]) and circles denote models based on CNNs (PSLA [2], ERANN-1-6 [3], Wavegramlogmel-CNN [1], CNN14 [1], KD-CNN [10], MobileNets [7] -ours).…”

Section: Introductionmentioning

confidence: 99%

“…the art in AT [16,17,18,19]. However, Transformers are complex in terms of parameters compared to CNNs, and the global self-attention mechanism scales quadratically with respect to the sequence length, making training and inference slow, and the deployment on edge devices infeasible.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation

Schmid¹,

Koutini²,

Widmer³

2022

Preprint

View full text Add to dashboard Cite

Audio Spectrogram Transformer models rule the field of Audio Tagging, outrunning previously dominating Convolutional Neural Networks (CNNs). Their superiority is based on the ability to scale up and exploit large-scale datasets such as AudioSet. However, Transformers are demanding in terms of model size and computational requirements compared to CNNs. We propose a training procedure for efficient CNNs based on offline Knowledge Distillation (KD) from highperforming yet complex transformers. The proposed training schema and the efficient CNN design based on MobileNetV3 results in models outperforming previous solutions in terms of parameter and computational efficiency and prediction performance. We provide models of different complexity levels, scaling from low-complexity models up to a new state-of-theart performance of .483 mAP on AudioSet. 1

show abstract

Section: Related Workmentioning

confidence: 99%

“…We apply the same preprocessing as in [17]. We use mono audio sampled at 32 kHz and compute Mel features from 25 ms windows with a hop size of 10 ms.…”

Section: Training Setupmentioning

confidence: 99%

Section: Knowledge Distillationmentioning

confidence: 99%