Cross-modal Embeddings for Video and Audio Retrieval

Surís, Dídac; Duarte, Amanda; Salvador, Amaia; Torres, Jordi; Giró-i-Nieto, Xavier

doi:10.1007/978-3-030-11018-5_62

Cited by 35 publications

(29 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…3.3), we propose an audio-visual distance learning network (AVDLN) as illustrated in Fig. 3(b); we notice similar networks are studied in concurrent works [38,49]. Our network can measure the distance D θ (V i , A i ) for a given pair of V i and A i .…”

Section: Methods For Cross-modality Localizationmentioning

confidence: 99%

Audio-Visual Event Localization in Unconstrained Videos

Tian

Jing

et al. 2018

Lecture Notes in Computer Science

328

381

View full text Add to dashboard Cite

In this paper, we introduce a novel problem of audio-visual event localization in unconstrained videos. We define an audio-visual event as an event that is both visible and audible in a video segment. We collect an Audio-Visual Event (AVE) dataset to systemically investigate three temporal localization tasks: supervised and weakly-supervised audio-visual event localization, and cross-modality localization. We develop an audio-guided visual attention mechanism to explore audio-visual correlations, propose a dual multimodal residual network (DMRN) to fuse information over the two modalities, and introduce an audio-visual distance learning network to handle the cross-modality localization. Our experiments support the following findings: joint modeling of auditory and visual modalities outperforms independent modeling, the learned attention can capture semantics of sounding objects, temporal alignment is important for audio-visual fusion, the proposed DMRN is effective in fusing audio-visual features, and strong correlations between the two modalities enable cross-modality localization.achieves the best fusion results. For weakly-supervised learning, we formulate it as a Multiple Instance Learning (MIL) [11] task, and modify our network structure via adding a MIL pooling layer to handle the problem. To address the harder cross-modality localization task, we propose an audio-visual distance learning network that measures the relativeness of any given pair of audio and visual content. It projects audio and visual features into subspaces with the same dimension. Contrastive loss [12] is introduced to learn the network.Observing that there is no publicly available dataset directly suitable for our tasks, we collect a large video dataset that consists of 4143 10-second videos with both audio and video tracks for 28 audio-visual events and annotate their temporal boundaries. Videos in our dataset are originated from YouTube, thus they are unconstrained. Our extensive experiments support the following findings: modeling jointly over auditory and visual modalities outperforms modeling independently over them, audio-visual event localization in a noisy condition can still achieve promising results, the audio-guided visual attention can well capture semantic regions covering sounding objects and can even distinguish audio-visual unrelated videos, temporal alignment is important for audio-visual fusion, the proposed dual multimodal residual network is effective in addressing the fusion task, and strong correlations between the two modalities enable crossmodality localization. These findings have paved a way for our community to solve harder, high-level understanding problems in the future, such as video captioning [13] and movieQA [14], where the auditory modality plays an important role in understanding video but lacks effective modeling.Our work makes the following contributions: (1) a family of three audio-visual event localization tasks; (2) an audio-guided visual attention model to adaptively explore the audio-visual corre...

show abstract

Section: Methods For Cross-modality Localizationmentioning

confidence: 99%

Audio-Visual Event Localization in Unconstrained Videos

Tian

Jing

et al. 2018

Lecture Notes in Computer Science

328

381

View full text Add to dashboard Cite

show abstract

“…Unlike other retrieval tasks such as the text-image task [48,49,50] or the sound-text task [51] , the audio-visual retrieval task mainly focuses on subspace learning. Didac et al [52] proposed a new joint embedding model that mapped two modalities into a joint embedding space, and then directly calculated the Euclidean distance between them. The authors leveraged cosine similarity to ensure that the two modalities in the same space were as close as possible while not overlapping.…”

Section: Audio-image Retrievalmentioning

confidence: 99%

Deep Audio-Visual Learning: A Survey

Zhu

Luo

Wang

et al. 2020

Preprint

View full text Add to dashboard Cite

Audio-visual learning, aimed at exploiting the relationship between audio and visual modalities, has drawn considerable attention since deep learning started to be used successfully. Researchers tend to leverage these two modalities either to improve the performance of previously considered single-modality tasks or to address new challenging problems. In this paper, we provide a comprehensive survey of recent audio-visual learning development. We divide the current audio-visual learning tasks into four different subfields: audio-visual separation and localization, audio-visual correspondence learning, audio-visual generation, and audio-visual representation learning. State-of-the-art methods as well as the remaining challenges of each subfield are further discussed. Finally, we summarize the commonly used datasets and performance metrics.

show abstract

“…Synchronization between the visual and audio or text as a source for self-supervised learning has been studied before [6,13]. In [6], the authors suggest a method to learn the correspondence between audio captions and images, for the task of image retrieval.…”

Section: Related Workmentioning

confidence: 99%

Learning to Detect and Retrieve Objects From Unlabeled Videos

Amrani¹,

Ben-Ari²,

Hakim³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW)

View full text Add to dashboard Cite

Learning an object detector or retrieval requires a large data set with manual annotations. Such data sets are expensive and time consuming to create and therefore difficult to obtain on a large scale. In this work, we propose to exploit the natural correlation in narrations and the visual presence of objects in video, to learn an object detector and retrieval without any manual labeling involved. We pose the problem as weakly supervised learning with noisy labels, and propose a novel object detection paradigm under these constraints. We handle the background rejection by using contrastive samples and confront the high level of label noise with a new clustering score. Our evaluation is based on a set of 11 manually annotated objects in over 5000 frames. We show comparison to a weakly-supervised approach as baseline and provide a strongly labeled upper bound.

show abstract

Cross-modal Embeddings for Video and Audio Retrieval

Cited by 35 publications

References 15 publications

Audio-Visual Event Localization in Unconstrained Videos

Audio-Visual Event Localization in Unconstrained Videos

Deep Audio-Visual Learning: A Survey

Learning to Detect and Retrieve Objects From Unlabeled Videos

Contact Info

Product

Resources

About