2019
DOI: 10.1007/978-3-030-20873-8_18
|View full text |Cite
|
Sign up to set email alerts
|

On Learning Associations of Faces and Voices

Abstract: In this paper, we study the associations between human faces and voices. Audiovisual integration, specifically the integration of facial and vocal information is a well-researched area in neuroscience. It is shown that the overlapping information between the two modalities plays a significant role in perceptual tasks such as speaker identification. Through an online study on a new dataset we created, we confirm previous findings that people can associate unseen faces with corresponding voices and vice versa wi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
44
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
3
2

Relationship

2
8

Authors

Journals

citations
Cited by 59 publications
(45 citation statements)
references
References 43 publications
1
44
0
Order By: Relevance
“…The associations between faces and voices have been studied extensively in many scientific disciplines. In the domain of computer vision, different cross-modal matching methods have been proposed: a binary or multi-way classification task [34,33,44]; metric learning [27,21]; and the multi-task classification loss [50]. Cross-modal signals extracted from faces and voices have been used to disambiguate voiced and unvoiced consonants [36,9]; to identify active speakers of a video from non-speakers therein [20,17]; to separate mixed speech signals of multiple speakers [14]; to predict lip motions from speech [36,3]; or to learn the correlation between speech and emotion [2].…”
Section: Related Workmentioning
confidence: 99%
“…The associations between faces and voices have been studied extensively in many scientific disciplines. In the domain of computer vision, different cross-modal matching methods have been proposed: a binary or multi-way classification task [34,33,44]; metric learning [27,21]; and the multi-task classification loss [50]. Cross-modal signals extracted from faces and voices have been used to disambiguate voiced and unvoiced consonants [36,9]; to identify active speakers of a video from non-speakers therein [20,17]; to separate mixed speech signals of multiple speakers [14]; to predict lip motions from speech [36,3]; or to learn the correlation between speech and emotion [2].…”
Section: Related Workmentioning
confidence: 99%
“…We obtain the attributes of the VoxCeleb2 test set by using the state-of-the-art, Rude et al [26] and Feng et al [27] for 40 facial appearance attributes (defined in the CelebA dataset [28]) and 3D head orientation, respectively. We focus on the relationship between the behavior of attention weights and attributes, considering the fact that Kim et al [5] already showed the connections of face/voice representations with certain demographic attributes. Yaw 43 57 46 54 44 56 Pitch 44 56 41 59 42 58 Roll 44 56 43 57 47 As a statistical measure, given an attribute A, we measure the expectation of the probability EP (α f >ᾱ f |A=true),ᾱ f denotes the global mean of the face attention over all the test data, and likewise for the voice.…”
Section: Analysis Of the Attention Layermentioning
confidence: 99%
“…A number of recent works [17,18,19,20,21] have explored the concept of exploiting the correspondence between synchronous audio and visual data in teacher-student style architectures (where the 'teacher' is represented by a pretrained network) [17,19], or two-stream networks where both networks are trained from scratch [18,22]. Additional work has examined cross-modal relationships between faces and voices specifically in order to learn identity [23,24,25] or emotion [7] representations. In contrast to these works, we aim to learn representations of both content and identity with a view to explicitly disentangling separate factors-we compare our approach with theirs in Sec.…”
Section: Related Workmentioning
confidence: 99%