Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-227
|View full text |Cite
|
Sign up to set email alerts
|

Efficient Training of Audio Transformers with Patchout

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
46
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 75 publications
(55 citation statements)
references
References 11 publications
0
46
0
Order By: Relevance
“…Audio Spectrogram Transformers: Inspired by the Vision Transformer (ViT) [12], transformers capable of processing images have been adapted to the audio domain. Vision and Audio Spectrogram transformers [16,17,18,19] extract overlapping patches with a certain stride and size of the input image, add a positional encoding, and apply transformer layers to the flattened sequence of patches. Transformer layers use a global attention mechanism that leads to computation and memory complexity scaling quadratically with with respect to the input sequence.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…Audio Spectrogram Transformers: Inspired by the Vision Transformer (ViT) [12], transformers capable of processing images have been adapted to the audio domain. Vision and Audio Spectrogram transformers [16,17,18,19] extract overlapping patches with a certain stride and size of the input image, add a positional encoding, and apply transformer layers to the flattened sequence of patches. Transformer layers use a global attention mechanism that leads to computation and memory complexity scaling quadratically with with respect to the input sequence.…”
Section: Related Workmentioning
confidence: 99%
“…We apply the same preprocessing as in [17]. We use mono audio sampled at 32 kHz and compute Mel features from 25 ms windows with a hop size of 10 ms.…”
Section: Training Setupmentioning
confidence: 99%
See 3 more Smart Citations