ESResNe(X)t-fbsp: Learning Robust Time-Frequency Transformation of Audio

Guzhov, Andrey; Raue, Federico; Hees, J.J. van; Dengel, Andreas

doi:10.1109/ijcnn52387.2021.9533654

Cited by 26 publications

(26 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The environmental sound classification task implies an assignment of correct labels given samples belonging to sound classes that surround us in the everyday life (e.g., "alarm clock", "car horn", "jackhammer", "mouse clicking", "cat"). To successfully solve this task, different approaches were proposed that included the use of one- [27,28] or two-dimensional Convolutional Neural Networks (CNN) operating on static [18,24,32,9,15,17,33,8,30] or trainable [23,10] time-frequency transformation of raw audio. While the first approaches relied on the task-specific design of models, the latter results confirmed that the use of domain adaptation from visual domain is beneficial [9,17,10].…”

Section: Related Workmentioning

confidence: 99%

“…To successfully solve this task, different approaches were proposed that included the use of one- [27,28] or two-dimensional Convolutional Neural Networks (CNN) operating on static [18,24,32,9,15,17,33,8,30] or trainable [23,10] time-frequency transformation of raw audio. While the first approaches relied on the task-specific design of models, the latter results confirmed that the use of domain adaptation from visual domain is beneficial [9,17,10]. However, the visual modality was used in a sequential way, implying the processing of only one modality simultaneously.…”

Section: Related Workmentioning

confidence: 99%

“…In this section, we describe the key components that make up the proposed model and the way how it handles its input. On a high level, our hybrid architecture combines a ResNet-based CLIP model [21] for visual and textual modalities and an ESResNeXt model [10] for audible modality, as can be seen in Figure 1.…”

Section: Modelmentioning

confidence: 99%

“…The latest advances of the sound classification community provided many powerful audio-domain models that demonstrated impressive results. Combination of widely known datasets -such as AudioSet [7], UrbanSound8K [25] and ESC-50 [19] -and domain-specific and inter-domain techniques conditioned the rapid development of audio-dedicated methods and approaches [15,10,30].…”

Section: Introductionmentioning

confidence: 99%

“…In our work, we propose an approach to combine a high-performance audio model -ESResNeXt [10] -into a contrastive text-image model, namely CLIP [21], thus, obtaining a tri-modal hybrid architecture. The base CLIP model demonstrates impressive performance and strong domain adaptation capabilities that are referred as "zero-shot inference" in the original paper [21].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

AudioCLIP: Extending CLIP to Image, Text and Audio

Guzhov¹,

Raue²,

Hees³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

In the past, the rapidly evolving field of sound classification greatly benefited from the application of methods from other domains. Today, we observe the trend to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models. In this work, we present an extension of the CLIP model that handles audio in addition to text and images. Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset. Such a combination enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP's ability to generalize to unseen datasets in a zero-shot inference fashion. AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task, out-performing other approaches by reaching accuracies of 90.07 % on the UrbanSound8K and 97.15 % on the ESC-50 datasets. Further it sets new baselines in the zero-shot ESCtask on the same datasets (68.78 % and 69.40 %, respectively). Finally, we also assess the cross-modal querying performance of the proposed model as well as the influence of full and partial training on the results. For the sake of reproducibility, our code is published.

show abstract