2019
DOI: 10.1007/978-3-030-11018-5_62
|View full text |Cite
|
Sign up to set email alerts
|

Cross-modal Embeddings for Video and Audio Retrieval

Abstract: The increasing amount of online videos brings several opportunities for training self-supervised neural networks. The creation of large scale datasets of videos such as the YouTube-8M allows us to deal with this large amount of data in manageable way. In this work, we find new ways of exploiting this dataset by taking advantage of the multi-modal information it provides. By means of a neural network, we are able to create links between audio and visual documents, by projecting them into a common region of the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
29
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 35 publications
(29 citation statements)
references
References 15 publications
0
29
0
Order By: Relevance
“…3.3), we propose an audio-visual distance learning network (AVDLN) as illustrated in Fig. 3(b); we notice similar networks are studied in concurrent works [38,49]. Our network can measure the distance D θ (V i , A i ) for a given pair of V i and A i .…”
Section: Methods For Cross-modality Localizationmentioning
confidence: 99%
“…3.3), we propose an audio-visual distance learning network (AVDLN) as illustrated in Fig. 3(b); we notice similar networks are studied in concurrent works [38,49]. Our network can measure the distance D θ (V i , A i ) for a given pair of V i and A i .…”
Section: Methods For Cross-modality Localizationmentioning
confidence: 99%
“…Unlike other retrieval tasks such as the text-image task [48,49,50] or the sound-text task [51] , the audio-visual retrieval task mainly focuses on subspace learning. Didac et al [52] proposed a new joint embedding model that mapped two modalities into a joint embedding space, and then directly calculated the Euclidean distance between them. The authors leveraged cosine similarity to ensure that the two modalities in the same space were as close as possible while not overlapping.…”
Section: Audio-image Retrievalmentioning
confidence: 99%
“…Synchronization between the visual and audio or text as a source for self-supervised learning has been studied before [6,13]. In [6], the authors suggest a method to learn the correspondence between audio captions and images, for the task of image retrieval.…”
Section: Related Workmentioning
confidence: 99%