Introducing crossmodal biometrics: Person identification from distinct audio &amp;amp; visual streams

Marcel, Sébastien

doi:10.1109/btas.2010.5634477

Cited by 4 publications

(3 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Le and Odobez [28] use transfer learning from face embeddings to try and improve speaker diarisation results. The only attempt we can find to solve a similar task to the one proposed here (but only for videos, and not still face images) is by [38]. This work seeks to map a statistical model of the features in one modality to a statistical model of the features in another modality.…”

Section: Related Workmentioning

confidence: 99%

Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching

Nagrani

Albanie

Zisserman

2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

187

142

View full text Add to dashboard Cite

We introduce a seemingly impossible task: given only an audio clip of someone speaking, decide which of two face images is the speaker. In this paper we study this, and a number of related cross-modal tasks, aimed at answering the question: how much can we infer from the voice about the face and vice versa?We study this task "in the wild", employing the datasets that are now publicly available for face recognition from static images (VGGFace) and speaker identification from audio (VoxCeleb). These provide training and testing scenarios for both static and dynamic testing of cross-modal matching. We make the following contributions: (i) we introduce CNN architectures for both binary and multi-way cross-modal face and audio matching; (ii) we compare dynamic testing (where video information is available, but the audio is not from the same video) with static testing (where only a single still image is available); and (iii) we use human testing as a baseline to calibrate the difficulty of the task. We show that a CNN can indeed be trained to solve this task in both the static and dynamic scenarios, and is even well above chance on 10-way classification of the face given the voice. The CNN matches human performance on easy examples (e.g. different gender across faces) but exceeds human performance on more challenging examples (e.g. faces with the same gender, age and nationality) 1 .

show abstract

Section: Related Workmentioning

confidence: 99%

Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching

Nagrani

Albanie

Zisserman

2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

187

142

View full text Add to dashboard Cite

show abstract

“…Their robot uses a combination of clothes, height, and face recognition to identify enrolled individuals and follow them through an environment filled with unknown people. Other, more preliminary work, by Roy and Marcel [9], explores the reconstruction of missing audio/video recognition models from different perceptual modalities. For example, if a speaker is known only by voice, they could be recognized from lip movements in a video.…”

Section: Related Workmentioning

confidence: 99%

Learning speaker recognition models through human-robot interaction

Martinson

Lawson

2011

2011 IEEE International Conference on Robotics and Automation

View full text Add to dashboard Cite

Person identification is the problem of identifying an individual that a computer system is seeing, hearing, etc. Typically this is accomplished using models of the individual. Over time, however, people change. Unless the models stored by the robot change with them, those models will became less and less reliable over time. This work explores automatic updating of person identification models in the domain of speaker recognition. By fusing together tracking and recognition systems from both visual and auditory perceptual modalities, the robot can robustly identify people during continuous interactions and update its models in real-time, improving rates of speaker classification.

show abstract

“…retrieving an image for a given text, and vice-versa. In biometrics, this is referred to as cross-modal recognition [25], [26]. Several solutions developed in multi-view learning lend themselves naturally to the task of cross-modal retrieval.…”

Section: B Related Work Elsewherementioning

confidence: 99%

Learning Discriminative Factorized Subspaces With Application to Touchscreen Biometrics

Pokhriyal

Govindaraju

2020

IEEE Access

View full text Add to dashboard Cite

Information fusion is a challenging problem in biometrics, where data comes from multiple biometric modalities or multiple feature spaces extracted from the same modality. Learning from heterogeneous data sources, in general, is termed as multi-view learning, where view is an encompassing term that refers to different sets of observations having distinct statistical properties. Most of the existing approaches to learning from multiple views either assume that the views are either independent or fully dependent. However, in real scenarios, these assumptions are almost never truly satisfied. In this work, we relax these assumptions. We propose a feature fusion method called Discriminative Factorized Subspaces (DFS) that learns a factorized subspace consisting of a single shared subspace (that captures the common information), and view-specific subspaces that captures information specific to each view. DFS jointly learns these subspaces, by posing the optimization problem as a constrained Rayleigh Quotient based formulation, whose solution is efficiently obtained using generalized eigenvalue decomposition. Our method does not require lots of data to learn from, and we show how it is apt for domains characterized by limited training data, and high intra-class variability. As an application, we tackle the challenging problem of touchscreen biometrics, which is based on the study of user interactions with their touch screens. Through extensive experimentation and thorough evaluation, we demonstrate how DFS learns a better discriminatory boundary, and provides a superior performance than state of the art methods for touchscreen biometric verification.INDEX TERMS Touchscreen biometrics, multi-modal biometrics, multi-modal data, feature fusion

show abstract

Introducing crossmodal biometrics: Person identification from distinct audio & visual streams

Cited by 4 publications

References 19 publications

Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching

Seeing Voices and Hearing Faces: Cross-Modal Biometric Matching

Learning speaker recognition models through human-robot interaction

Learning Discriminative Factorized Subspaces With Application to Touchscreen Biometrics

Contact Info

Product

Resources

About