2022
DOI: 10.1609/aaai.v36i10.21315
|View full text |Cite
|
Sign up to set email alerts
|

SSAST: Self-Supervised Audio Spectrogram Transformer

Abstract: Recently, neural networks based purely on self-attention, such as the Vision Transformer (ViT), have been shown to outperform deep learning models constructed with convolutional neural networks (CNNs) on various vision tasks, thus extending the success of Transformers, which were originally developed for language processing, to the vision domain. A recent study showed that a similar methodology can also be applied to the audio domain. Specifically, the Audio Spectrogram Transformer (AST) achieves state-of-the-… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
42
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 108 publications
(43 citation statements)
references
References 33 publications
(35 reference statements)
1
42
0
Order By: Relevance
“…In-domain pre-training for audio. Existing in-domain (audio-only) self-supervised methods can be broadly categorized by the input signal (e.g., raw waveform [32,33,34], frame-level features [35,36,37] or spectrogram patches [18,38]) and the objective used for self-supervision (e.g., contrastive [39,33,40,41,35] or prediction/reconstruction [18,34,37,36]). For example, wav2vec 2.0 [33] takes raw waveform as inputs and exploits contrastive learning to discriminate contextualized representations in different time segments.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…In-domain pre-training for audio. Existing in-domain (audio-only) self-supervised methods can be broadly categorized by the input signal (e.g., raw waveform [32,33,34], frame-level features [35,36,37] or spectrogram patches [18,38]) and the objective used for self-supervision (e.g., contrastive [39,33,40,41,35] or prediction/reconstruction [18,34,37,36]). For example, wav2vec 2.0 [33] takes raw waveform as inputs and exploits contrastive learning to discriminate contextualized representations in different time segments.…”
Section: Related Workmentioning
confidence: 99%
“…Mockingjay [42] proposed a masked acoustic model pretext task to reconstruct frame-level Mel-features of masked time frames. SS-AST [18] is a self-supervised learning method operates over spectrogram patches and employs joint contrastive and reconstructive objectives on masked patches. Previous methods generate audio representations by encoding full-view of both masked and non-masked time or spectrogram segments for self-supervised pre-training.…”
Section: Related Workmentioning
confidence: 99%
See 3 more Smart Citations