Unsupervised Contrastive Learning of Sound Event Representations

Fonseca, Eduardo; Ortego, Diego; McGuinness, Kevin; O’Connor, Noel E.; Serra, Xavier

doi:10.1109/icassp39728.2021.9415009

Cited by 34 publications

(59 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We choose mixup because the concept of mixing sounds is an audio-informed operation, and it has been proven useful for SET [2,3,29] and other sound event research tasks [30]. In our view, mixup can be interpreted from two different perspectives.…”

Section: Mixupmentioning

confidence: 99%

Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks

Fonseca,

Ferraro,

Serra

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Recent studies have put into question the commonly assumed shift invariance property of convolutional networks, showing that small shifts in the input can affect the output predictions substantially. In this paper, we analyze the benefits of addressing lack of shift invariance in CNN-based sound event classification. Specifically, we evaluate two pooling methods to improve shift invariance in CNNs, based on low-pass filtering and adaptive sampling of incoming feature maps. These methods are implemented via small architectural modifications inserted into the pooling layers of CNNs. We evaluate the effect of these architectural changes on the FSD50K dataset using models of different capacity and in presence of strong regularization. We show that these modifications consistently improve sound event classification in all cases considered. We also demonstrate empirically that the proposed pooling methods increase shift invariance in the network, making it more robust against time/frequency shifts in input spectrograms. This is achieved by adding a negligible amount of trainable parameters, which makes these methods an appealing alternative to conventional pooling layers. The outcome is a new state-of-the-art mAP of 0.541 on the FSD50K classification benchmark.

show abstract

Section: Mixupmentioning

confidence: 99%

Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks

Fonseca,

Ferraro,

Serra

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…This paradigm has seen major progress in computer vision [9,10,11] and in speech recognition [12,13,7]. For general-purpose audio, including a variety of environmental sounds beyond speech, the majority of works are based on contrastive learning [14,15,16,17,18,19], where a representation is learned by comparing pairs of examples selected by some semantically-correlated notion of similarity [20]. Specifically, comparisons are made between positive pairs of "similar" and negative pairs of "dissimilar" examples, with the goal of learning a representation that pulls together positive pairs and thus reflects semantic structure.…”

Section: Introductionmentioning

confidence: 99%

“…Recently, promising results have been attained by contrastive learning approaches that solve the proxy task of similarity maximization [17,18,19], following the seminal SimCLR work in visual representation learning [9]. This method consists of maximizing the similarity between differently-augmented views of the same input audio example.…”

Section: Introductionmentioning

confidence: 99%

“…This method consists of maximizing the similarity between differently-augmented views of the same input audio example. Critical to its success is the simultaneous use of a diversity of semantics-preserving, domain-specific augmentation methods [9,18]. For audio modeling, proven augmentation strategies include sampling nearby audio frames [14,15,17,18,19], artificial example mixing [18,19,14], time/frequency masking [18,19], random resized cropping [18] and time/frequency shifts [14,18,19].…”

Section: Introductionmentioning

confidence: 99%

“…Critical to its success is the simultaneous use of a diversity of semantics-preserving, domain-specific augmentation methods [9,18]. For audio modeling, proven augmentation strategies include sampling nearby audio frames [14,15,17,18,19], artificial example mixing [18,19,14], time/frequency masking [18,19], random resized cropping [18] and time/frequency shifts [14,18,19]. In most cases, these augmentations introduce artificial, handcrafted transformations with hyperparameters that must be tuned to lie within a semantics-preserving range.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Self-Supervised Learning from Automatically Separated Sound Scenes

Fonseca¹,

Jansen²,

Ellis³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Real-world sound scenes consist of time-varying collections of sound sources, each generating characteristic sound events that are mixed together in audio recordings. The association of these constituent sound events with their mixture and each other is semantically constrained: the sound scene contains the union of source classes and not all classes naturally co-occur. With this motivation, this paper explores the use of unsupervised automatic sound separation to decompose unlabeled sound scenes into multiple semantically-linked views for use in self-supervised contrastive learning. We find that learning to associate input mixtures with their automatically separated outputs yields stronger representations than past approaches that use the mixtures alone. Further, we discover that optimal source separation is not required for successful contrastive learning by demonstrating that a range of separation system convergence states all lead to useful and often complementary example transformations. Our best system incorporates these unsupervised separation models into a single augmentation front-end and jointly optimizes similarity maximization and coincidence prediction objectives across the views. The result is an unsupervised audio representation that rivals state-of-the-art alternatives on the established shallow AudioSet classification benchmark.

show abstract