Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching

Nagrani, Arsha; Albanie, Samuel; Zisserman, Andrew

doi:10.1109/cvpr.2018.00879

Cited by 187 publications

(146 citation statements)

References 53 publications

Supporting

Mentioning

142

Contrasting

Order By: Relevance

“…Recently, a dataset tailored towards audio-visual biometrics was introduced [6], [29] to aid the learning of audio and visual information and thus obtaining a joint representation. Many works have been focused towards speaker recognition and matching from audio and visual signals [5], [8]. Although these works effectively capture cross modal embeddings, however they require either separate networks for each modality and/or require pair selection during training to effectively penalize the negative pairs.…”

Section: Joint Latent Space Representationmentioning

confidence: 99%

“…However, recently VoxCeleb dataset [6] has been introduced which comprises of a collection of video and audio recordings of a large number of celebrities. Previous works in literature [7], [5], [8] have modeled the problem of cross modal matching by employing separate networks for multiple modalities in either triplet network fashion or subnetwork. Separate networks in triplet fashion may help with modularity given few modalities (two in this case) at input, but it is important to take into account the possibility of multiple input streams (text, image, voice, etc).…”

Section: Introductionmentioning

confidence: 99%

“…Nagrani et. al [5] performed experiments in static and dynamic settings. A five stream dynamic-fusion architecture 1 requires five subnetworks to account for this fusion.…”

Section: Introductionmentioning

confidence: 99%

“…Our network performs under no restrictions in terms of triplet selection. We perform series of experiments inspired from [5], [9]. Furthermore, we perform two additional experiments to establish the robustness of our methodology.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Deep Latent Space Learning for Cross-Modal Mapping of Audio and Visual Signals

Nawaz

Janjua

Gallo

et al. 2019

2019 Digital Image Computing: Techniques and Applications (DICTA)

View full text Add to dashboard Cite

We propose a novel deep training algorithm for joint representation of audio and visual information which consists of a single stream network (SSNet) coupled with a novel loss function to learn a shared deep latent space representation of multimodal information. The proposed framework characterizes the shared latent space by leveraging the class centers which helps to eliminate the need of pairwise or triplet supervision. We quantitatively and qualitatively evaluate the proposed approach on VoxCeleb, a benchmarks audio-visual dataset on multitude of tasks including cross-modal verification, cross-modal matching and cross-modal retrieval. State-of-the-art performance is achieved on cross-modal verification and matching while comparable results are observed on the remaining applications. Our experiments demonstrate the effectiveness of the technique for cross-modal biometric applications.

show abstract

Section: Joint Latent Space Representationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

“…Nagrani et. al [5] performed experiments in static and dynamic settings. A five stream dynamic-fusion architecture 1 requires five subnetworks to account for this fusion.…”

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Deep Latent Space Learning for Cross-Modal Mapping of Audio and Visual Signals

Nawaz

Janjua

Gallo

et al. 2019

2019 Digital Image Computing: Techniques and Applications (DICTA)

View full text Add to dashboard Cite

show abstract

“…Beyond the merits of cultivating a better understanding of the operation of cross-modal sensory information integration in vertebrates, there is the possibility that an accurate computational model for this phenomenon could translate into a general algorithm for pattern recognition tasks in computer science. A direct application of this 2/20 method lies in the development of novel information fusion algorithms that leverage inputs from multiple sensory modalities, i.e, vision and audition [34]. Another practical application is the invention of innovative sensors capable of detecting changes in the environment and then re-configuring on the fly to change operational parameters and power consumption requirements.…”

Section: /20mentioning

confidence: 99%

An Extreme Value Theory Model of Cross-Modal Sensory Information Integration in Modulation of Vertebrate Visual System Functions

Banerjee

Scheirer

2018

Preprint

View full text Add to dashboard Cite

We propose a computational model of vision that describes the integration of cross-modal sensory information between the olfactory and visual systems in zebrafish based on the principles of the statistical extreme value theory. The integration of olfacto-retinal information is mediated by the centrifugal pathway that originates from the olfactory bulb and terminates in the neural retina. Motivation for using extreme value theory stems from physiological evidence suggesting that extremes and not the mean of the cell responses direct cellular activity in the vertebrate brain. We argue that the visual system, as measured by retinal ganglion cell responses in spikes/sec, follows an extreme value process for sensory integration and the increase in visual sensitivity from the olfactory input can be better modeled using extreme value distributions. As zebrafish maintains high evolutionary proximity to mammals, our model can be extended to other vertebrates as well.

show abstract

Can you trust what you hear: Effects of audio‐attacks on voice‐to‐face generation system

Chen

Zhu

Zhao

et al. 2022

Int J of Intelligent Sys

View full text Add to dashboard Cite

Owing to the widespread deployment of face and speaker recognition systems, research on attacks on neural‐network‐based biometric systems, which involves face or voice signal classification problems with a low‐dimensional output vector, has drawn increasing attention. Recently, cross‐modal voice‐to‐face (VTF) systems have learned to generate faces from voices by matching several biometric characteristics of the generated faces to those of speakers. However, attacks focusing on VTF systems with high‐dimensional face image outputs have not yet been conducted. In this paper, we introduce various adversarial attack methods for the VTF system under different attack conditions. These methods can generate a fake face close to the target face or far from the original face, by adding subtle perturbations to the original voice. Under the white‐box setting, we formulate a multiobjective optimization to generate target faces and improve the imperceptibility of the adversarial sample. Further a stepwise iterative optimization strategy is proposed to achieve faster and more effective attacks. Finally, the results of comparative experiments with various methods are demonstrated. Under the black‐box setting, the adversarial samples generated from surrogate models are able to generate the fake face far from the original one. Qualitative and quantitative experimental results show the high target face‐matching rate and irrelevance to the original face, as well as the imperceptibility of the adversarial audio. This study provides useful insights for privacy protection and improving generation robustness for information security.

show abstract

Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching

Cited by 187 publications

References 53 publications

Deep Latent Space Learning for Cross-Modal Mapping of Audio and Visual Signals

Deep Latent Space Learning for Cross-Modal Mapping of Audio and Visual Signals

An Extreme Value Theory Model of Cross-Modal Sensory Information Integration in Modulation of Vertebrate Visual System Functions

Can you trust what you hear: Effects of audio‐attacks on voice‐to‐face generation system

Contact Info

Product

Resources

About