SSAST: Self-Supervised Audio Spectrogram Transformer

Gong, Yuan; Lai, Cheng-I; Chung, Yu-An; Glass, James

doi:10.1609/aaai.v36i10.21315

Cited by 108 publications

(43 citation statements)

References 33 publications

(35 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In-domain pre-training for audio. Existing in-domain (audio-only) self-supervised methods can be broadly categorized by the input signal (e.g., raw waveform [32,33,34], frame-level features [35,36,37] or spectrogram patches [18,38]) and the objective used for self-supervision (e.g., contrastive [39,33,40,41,35] or prediction/reconstruction [18,34,37,36]). For example, wav2vec 2.0 [33] takes raw waveform as inputs and exploits contrastive learning to discriminate contextualized representations in different time segments.…”

Section: Related Workmentioning

confidence: 99%

“…Mockingjay [42] proposed a masked acoustic model pretext task to reconstruct frame-level Mel-features of masked time frames. SS-AST [18] is a self-supervised learning method operates over spectrogram patches and employs joint contrastive and reconstructive objectives on masked patches. Previous methods generate audio representations by encoding full-view of both masked and non-masked time or spectrogram segments for self-supervised pre-training.…”

Section: Related Workmentioning

confidence: 99%

“…Spectrogram Patch Embeddings. Following [10,18], we transform audio recordings into Melspectrograms and divide them into regular grid patches. These patches are then flattened and embedded by a linear projection.…”

Section: Audio Masked Autoencoders (Audio-mae)mentioning

confidence: 99%

“…On AudioSet-20K, its 37.1 mAP significantly outperforms all other approaches including concurrent works and other models with outof-domain pre-training. On AudioSet-2M and ESC-50, our method also outperforms Conformer [37] and SS-AST [18]. Notably, unlike SS-AST and concurrent MAE-AST [38], which trained with additional 1,000 hours of speech in Librispeech, we use only AudioSet for pre-training.…”

Section: Comparison With the State-of-the-artmentioning

confidence: 99%

“…Addressing these concerns, self-supervised audio representation learning has recently attracted much research attention. Based on BEiT [17] that learns to reconstruct image patches or learnt patch tokens, SS-AST [18] extends to the audio domain and exploits spectrograms (akin to 1-channel 2D images) and use both contrastive and reconstruction objective as self-supervision. Without using any labels, the key enabler to effective self-supervised representation learning is large-scale pre-training data.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Masked Autoencoders that Listen

Huang¹,

Xu²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper studies a simple extension of image-based Masked Autoencoders (MAE) [1] to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training. Our code and models will soon be available at https://github.com/facebookresearch/AudioMAE.Preprint. Under review.

show abstract