2021
DOI: 10.48550/arxiv.2104.06401
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Self-supervised object detection from audio-visual correspondence

Abstract: We tackle the problem of learning object detectors without supervision. Differently from weakly-supervised object detection, we do not assume image-level class labels. Instead, we extract a supervisory signal from audio-visual data, using the audio component to "teach" the object detector. While this problem is related to sound source localisation, it is considerably harder because the detector must classify the objects by type, enumerate each instance of the object, and do so even when the object is silent. W… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(9 citation statements)
references
References 72 publications
0
6
0
Order By: Relevance
“…For the comparison of the classifier CAM and ContraCAM, we use the publicly available supervised classifier 8 and MoCov2 9 trained on the ImageNet dataset under the ResNet-50 architecture. Here, we do not apply the expansion trick and run a single iteration for the ContraCAM.…”
Section: B1 Implementation Details For Object Localization Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…For the comparison of the classifier CAM and ContraCAM, we use the publicly available supervised classifier 8 and MoCov2 9 trained on the ImageNet dataset under the ResNet-50 architecture. Here, we do not apply the expansion trick and run a single iteration for the ContraCAM.…”
Section: B1 Implementation Details For Object Localization Resultsmentioning
confidence: 99%
“…Self-supervised learning of visual representations from unlabeled images is a fundamental task of machine learning, which establishes various applications including object recognition [1,2], reinforcement learning [3,4], out-of-distribution detection [5,6], and multimodal learning [7,8]. Recently, contrastive learning [1,2,[9][10][11][12][13][14][15] has shown remarkable advances along this line.…”
Section: Introductionmentioning
confidence: 99%
“…The advancement of Deep Learning enabled a multitude of self-supervised approaches for localizing sounds in recent years. The line of work most relevant for the problem we are considering in this manuscript aims at locating sound sources in unlabeled videos [2], [3], [6], [9], [12], [13]. Arandjelović and Zisserman [3] propose a framework for cross-modal selfsupervision from video, enabling the localization of soundemitting objects by correlating the features produced by an audio-specific and image-specific encoding network.…”
Section: Related Workmentioning
confidence: 99%
“…Self-supervised approaches for audio-visual object detection, in contrast, are able to localize sound-emitting objects in videos without explicit manual annotations for object positions. The majority of these approaches are based on audiovisual co-occurrence of shared features in both domains [2], [6]. The main idea in these works is to contrast two random video frames and their corresponding audio segment from a large-scale dataset with each other, leveraging the fact that the chance of randomly sampled frames from different videos containing the same object type is negligible.…”
Section: Introductionmentioning
confidence: 99%
“…However, the perception process is usually abstract, making it difficult to manually label quantitative tags. Due to the natural correspondence between sound and vision, necessary supervision is provided for audio-visual learning (Hu et al 2020) (Afouras et al 2021). Therefore, we design a cross-modal self-supervised learning method, which exploits the complementarity and consistency of multi-modal data to generate weight labels of perception.…”
Section: Introductionmentioning
confidence: 99%