ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9746312
|View full text |Cite
|
Sign up to set email alerts
|

HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
34
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 72 publications
(34 citation statements)
references
References 19 publications
0
34
0
Order By: Relevance
“…Audio Spectrogram Transformers: Inspired by the Vision Transformer (ViT) [12], transformers capable of processing images have been adapted to the audio domain. Vision and Audio Spectrogram transformers [16,17,18,19] extract overlapping patches with a certain stride and size of the input image, add a positional encoding, and apply transformer layers to the flattened sequence of patches. Transformer layers use a global attention mechanism that leads to computation and memory complexity scaling quadratically with with respect to the input sequence.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Audio Spectrogram Transformers: Inspired by the Vision Transformer (ViT) [12], transformers capable of processing images have been adapted to the audio domain. Vision and Audio Spectrogram transformers [16,17,18,19] extract overlapping patches with a certain stride and size of the input image, add a positional encoding, and apply transformer layers to the flattened sequence of patches. Transformer layers use a global attention mechanism that leads to computation and memory complexity scaling quadratically with with respect to the input sequence.…”
Section: Related Workmentioning
confidence: 99%
“…1. Crosses denote models based on Transformer architecture (Audio-MAE [19], HTS-AT [18], PaSST-S [17], PaSST-S-L [17], AST [16], KD-AST [10]) and circles denote models based on CNNs (PSLA [2], ERANN-1-6 [3], Wavegramlogmel-CNN [1], CNN14 [1], KD-CNN [10], MobileNets [7] -ours).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In terms of weakly supervised pre-training, the works in [3,5,6,7] proposed novel model architectures for weakly supervised AT and explored the performance on a limited amount of downstream tasks. Further, the works [8,9,10,11,12,13] explored selfsupervised learning with a linear classifier for sound classification, focusing on sound-domain downstream tasks.…”
Section: Related Workmentioning
confidence: 99%
“…Devlin et al [40] introduced a new language representation model called BERT, and used the pre-trained BERT model with fine-tuning to create stateof-the-art models for a wide range of tasks. Indeed, beyond revolutionising NLP, Transformers have outperform deep learning models constructed with CNNs on various audio classification tasks [41], thus extending its success to the audio domain [42,43,44].…”
Section: Related Workmentioning
confidence: 99%