2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018
DOI: 10.1109/cvpr.2018.00879
|View full text |Cite
|
Sign up to set email alerts
|

Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching

Abstract: We introduce a seemingly impossible task: given only an audio clip of someone speaking, decide which of two face images is the speaker. In this paper we study this, and a number of related cross-modal tasks, aimed at answering the question: how much can we infer from the voice about the face and vice versa?We study this task "in the wild", employing the datasets that are now publicly available for face recognition from static images (VGGFace) and speaker identification from audio (VoxCeleb). These provide trai… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
142
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
1
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 187 publications
(146 citation statements)
references
References 53 publications
1
142
0
Order By: Relevance
“…Recently, a dataset tailored towards audio-visual biometrics was introduced [6], [29] to aid the learning of audio and visual information and thus obtaining a joint representation. Many works have been focused towards speaker recognition and matching from audio and visual signals [5], [8]. Although these works effectively capture cross modal embeddings, however they require either separate networks for each modality and/or require pair selection during training to effectively penalize the negative pairs.…”
Section: Joint Latent Space Representationmentioning
confidence: 99%
See 3 more Smart Citations
“…Recently, a dataset tailored towards audio-visual biometrics was introduced [6], [29] to aid the learning of audio and visual information and thus obtaining a joint representation. Many works have been focused towards speaker recognition and matching from audio and visual signals [5], [8]. Although these works effectively capture cross modal embeddings, however they require either separate networks for each modality and/or require pair selection during training to effectively penalize the negative pairs.…”
Section: Joint Latent Space Representationmentioning
confidence: 99%
“…However, recently VoxCeleb dataset [6] has been introduced which comprises of a collection of video and audio recordings of a large number of celebrities. Previous works in literature [7], [5], [8] have modeled the problem of cross modal matching by employing separate networks for multiple modalities in either triplet network fashion or subnetwork. Separate networks in triplet fashion may help with modularity given few modalities (two in this case) at input, but it is important to take into account the possibility of multiple input streams (text, image, voice, etc).…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…Beyond the merits of cultivating a better understanding of the operation of cross-modal sensory information integration in vertebrates, there is the possibility that an accurate computational model for this phenomenon could translate into a general algorithm for pattern recognition tasks in computer science. A direct application of this 2/20 method lies in the development of novel information fusion algorithms that leverage inputs from multiple sensory modalities, i.e, vision and audition [34]. Another practical application is the invention of innovative sensors capable of detecting changes in the environment and then re-configuring on the fly to change operational parameters and power consumption requirements.…”
Section: /20mentioning
confidence: 99%