Do We Need Sound for Sound Source Localization?

Oya, Takashi; Iwase, Shohei; Natsume, Ryota; Itazuri, Takahiro; Yamaguchi, Shugo; Morishima, Shigeo

doi:10.1007/978-3-030-69544-6_8

Cited by 8 publications

(5 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other approaches include those that determine the temporal alignment of videos and sounds [5], [14], [15], and hybrid approaches that combine both tasks [16]. In addition to simple single-domain applications such as sound/image classification [10], [11], [16] and action recognition [5], [15], these works demonstrate the benefits of learned features in complex cross-domain applications such as sound localization [3], [12]- [15], crossmodal retrieval [4], and sound separation [5]. However, the target of these prior works is limited to learning semantic cross-modal relationships.…”

Section: Related Work A: Self-supervised Audio-visual Learningmentioning

confidence: 99%

Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds

2022

View full text Add to dashboard Cite

Learning cross-modal features is an essential task for many multimedia applications such as sound localization, audio-visual alignment, and image/audio retrieval. Most existing methods mainly focus on the semantic correspondence between videos and monaural sounds, and spatial information of sound sources has not been considered. However, sound locations are critical for understanding the sound environment. To this end, it is necessary to acquire cross-modal features that reflect the semantic and spatial relationship between videos and sounds. A video with stereo sound, which has become commonly used, provides the direction of arrival of each sound source in addition to the category information. This indicates its potential to acquire a desired cross-modal feature space. In this paper, we propose a novel self-supervised approach to learn a cross-modal feature representation that captures both the category and location of each sound source using stereo sound as input. For a set of unlabeled videos, the proposed method generates three kinds of audio-visual pairs: 1) perfectly matched pairs from the same video, 2) pairs from the same video but with the flipped stereo sound, and 3) pairs from a different video. The cross-modal feature encoder of the proposed method is trained on triplet loss to reflect the relationship between these three pairs (1 > 2 > 3). We apply this method to cross-modal image/audio retrieval. Compared with previous audio-visual pretext tasks, the proposed method shows significant improvement in both real and synthetic datasets.INDEX TERMS Computer vision, feature extraction, machine learning, self-supervised learning, audiovisual learning, cross-modal retrieval.

show abstract

Section: Related Work A: Self-supervised Audio-visual Learningmentioning

confidence: 99%

Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds

2022

View full text Add to dashboard Cite

show abstract

“…There have been many sound source localization methods in the field of computer vision, e.g. mutual information and CCA [4], [5], CAM-based [6], [7], attention mechanism based [8], [9], [10], [11], those that utilize motion information [12], [13], [14]. Sound source localization and audio-visual sound source separation are closely related because it is necessary to identify the position of the sound source in an image in order to perform audiovisual sound source separation.…”

Section: Related Workmentioning

confidence: 99%

The Sound of Bounding-Boxes

Oya¹,

Iwase²,

Morishima³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

In the task of audio-visual sound source separation, which leverages visual information for sound source separation, identifying objects in an image is a crucial step prior to separating the sound source. However, existing methods that assign sound on detected bounding boxes suffer from a problem that their approach heavily relies on pre-trained object detectors. Specifically, when using these existing methods, it is required to predetermine all the possible categories of objects that can produce sound and use an object detector applicable to all such categories. To tackle this problem, we propose a fully unsupervised method that learns to detect objects in an image and separate sound source simultaneously. As our method does not rely on any pre-trained detector, our method is applicable to arbitrary categories without any additional annotation. Furthermore, although being fully unsupervised, we found that our method performs comparably in separation accuracy.

show abstract

“…However, this work still requires extra scene prior due to the lack of one-to-one annotations. Oya et al [38] proposed a step-wise training strategy that first gets potential sounding objects based on visual information and then identifies the proposal based on audio information. Nevertheless, the experimental scenario in the work is relatively simple (two objects, one of which makes a sound).…”

Section: Sounding Object Localization In Visual Scenesmentioning

confidence: 99%

Class-aware Sounding Objects Localization via Audiovisual Correspondence

Hu¹,

Wei²,

Qu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Audiovisual scenes are pervasive in our daily life. It is commonplace for humans to discriminatively localize different sounding objects but quite challenging for machines to achieve class-aware sounding objects localization without category annotations, i.e., localizing the sounding object and recognizing its category. To address this problem, we propose a two-stage step-by-step learning framework to localize and recognize sounding objects in complex audiovisual scenarios using only the correspondence between audio and vision. First, we propose to determine the sounding area via coarse-grained audiovisual correspondence in the single source cases. Then visual features in the sounding area are leveraged as candidate object representations to establish a category-representation object dictionary for expressive visual character extraction. We generate class-aware object localization maps in cocktail-party scenarios and use audiovisual correspondence to suppress silent areas by referring to this dictionary. Finally, we employ category-level audiovisual consistency as the supervision to achieve fine-grained audio and sounding object distribution alignment. Experiments on both realistic and synthesized videos show that our model is superior in localizing and recognizing objects as well as filtering out silent ones. We also transfer the learned audiovisual network into the unsupervised object detection task, obtaining reasonable performance.

show abstract

Do We Need Sound for Sound Source Localization?

Cited by 8 publications

References 39 publications

Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds

Self-Supervised Learning for Audio-Visual Relationships of Videos With Stereo Sounds

The Sound of Bounding-Boxes

Class-aware Sounding Objects Localization via Audiovisual Correspondence

Contact Info

Product

Resources

About