2020
DOI: 10.1109/lsp.2020.2985586
|View full text |Cite
|
Sign up to set email alerts
|

Pre-Training Audio Representations With Self-Supervision

Abstract: We explore self-supervision as a way to learn general purpose audio representations. Specifically, we propose two self-supervised tasks: Audio2Vec, which aims at reconstructing a spectrogram slice from past and future slices and Temporal-Gap, which estimates the distance between two short audio segments extracted at random from the same audio clip. We evaluate how the representations learned via self-supervision transfer to different downstream tasks, either training a task-specific linear classifier on top of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
52
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 41 publications
(53 citation statements)
references
References 26 publications
(33 reference statements)
0
52
0
Order By: Relevance
“…We then compare COLA to prior self-supervised methods proposed in [16,25], including a standard triplet loss, AU-DIO2VEC (CBoW and SG) and temporal gap prediction models. Here, the CBoW and SG are generative models inspired from WORD2VEC, trained to reconstruct a randomly selected temporal slice of log-mel spectrograms given the rest or vice versa.…”
Section: Resultsmentioning
confidence: 99%
See 3 more Smart Citations
“…We then compare COLA to prior self-supervised methods proposed in [16,25], including a standard triplet loss, AU-DIO2VEC (CBoW and SG) and temporal gap prediction models. Here, the CBoW and SG are generative models inspired from WORD2VEC, trained to reconstruct a randomly selected temporal slice of log-mel spectrograms given the rest or vice versa.…”
Section: Resultsmentioning
confidence: 99%
“…CBoW [16,25] SG [16,25] TemporalGap [16,25] Triplet Loss [16,25] TRILL [13] ble 2 shows that COLA embeddings consistently outperform all these methods. In particular, on acoustic scene classification, we obtain a competitive accuracy of 94% compared to 73% achieved with a triplet loss in [16].…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…These representations can then be used for downstream tasks, for example where only few data, or poorly labeled data, are available. Self-supervised learning has shown great promise in computer vision [9,10], and interesting results in the audio domain [7,11]. One of the first works in self-supervised sound event representation learning is [7], adopting a triplet loss-based training by creating anchor-positive pairs via simple audio transformations, e.g., adding noise or mixing examples.…”
Section: Introductionmentioning
confidence: 99%