A Multi-View Approach to Audio-Visual Speaker Verification

Sarı, Leda; Singh, Kritika; Zhou, Jiatong; Torresani, Lorenzo; Singhal, Nayan; Saraf, Yatharth

doi:10.1109/icassp39728.2021.9414260

Cited by 23 publications

(9 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As is expected, fine-tuning with a large amount of labeled data improves performance. In audio-visual setting, our best model (0.84%) outperforms [33] (1.8% and 1.4%) with a single model and slightly falls behind its ensembled model (0.7%). Note in contrast to the prior works [18,33,32] which uses the whole face, our model only relies on the lip area of the speaker as visual input and achieves a better trade-off between privacy and performance.…”

Section: Comparison With Prior Workmentioning

confidence: 86%

“…In audio-visual setting, our best model (0.84%) outperforms [33] (1.8% and 1.4%) with a single model and slightly falls behind its ensembled model (0.7%). Note in contrast to the prior works [18,33,32] which uses the whole face, our model only relies on the lip area of the speaker as visual input and achieves a better trade-off between privacy and performance. In addition, we acknowledge the gap between our best model and the current SOTA on VC1 ( [25]: 0.38%).…”

Section: Comparison With Prior Workmentioning

confidence: 86%

“…We should note that unlike most of the prior work on learning audio-visual speaker embeddings [32,33], we use lip video (see figure 2) instead of whole-face video or image as input. This can significantly improve the robustness in noisy environments compared to audio-based systems while reducing the amount of biometric information required compared to facebased systems.…”

Section: Learning Multimodal Speaker Embeddingmentioning

confidence: 99%

“…We report SC accuracy (SC-ACC) and SV-EER on VC1 using the official test set and test trials. As for VC2-EER, since there does not exist official test trials, we follow [33] to create one by sampling one positive trial and one negative trial for each test set utterance. To probe noise robustness, we follow [16] to create 20 noisy test sets for VC1 and VC2, where each clean set is mixed with {Babble, Speech, Music, Other} noise at a SNR in {-10, -5, 0, 5, 10} dB.…”

Section: Setupmentioning

confidence: 99%

See 3 more Smart Citations

Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

Shi¹,

Mohamed²,

Hsu³

2022

Preprint

View full text Add to dashboard Cite

This paper investigates self-supervised pre-training for audiovisual speaker representation learning where a visual stream showing the speaker's mouth area is used alongside speech as inputs. Our study focuses on the Audio-Visual Hidden Unit BERT (AV-HuBERT) approach, a recently developed generalpurpose audio-visual speech pre-training framework. We conducted extensive experiments probing the effectiveness of pretraining and visual modality. Experimental results suggest that AV-HuBERT generalizes decently to speaker related downstream tasks, improving label efficiency by roughly ten fold for both audio-only and audio-visual speaker verification. We also show that incorporating visual information, even just the lip area, greatly improves the performance and noise robustness, reducing EER by 38% in the clean condition and 75% in noisy conditions. Our code and models will be publicly available.

show abstract

Section: Comparison With Prior Workmentioning

confidence: 86%

Section: Comparison With Prior Workmentioning

confidence: 86%

Section: Learning Multimodal Speaker Embeddingmentioning

confidence: 99%

Section: Setupmentioning

confidence: 99%

See 2 more Smart Citations

Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

Shi¹,

Mohamed²,

Hsu³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…These tasks are inherently selection problems in which the best fit of a voice-face pair from the dataset is desired. Another similar task is cross-modal verification [32,48,52] that tells whether input faces and voices belong to the same person, which is a simply classification problem for paired inputs. Our work solves its root question and explains the success in voice-face matching or verification by verifying correlations between voices and face geometry.…”

Section: Audio-visual Learningmentioning

confidence: 99%

Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?

Wu¹,

Hsu²,

Neumann³

2022

Preprint

View full text Add to dashboard Cite

This work digs into a root question in human perception: can face geometry be gleaned from one's voices? Previous works that study this question only adopt developments in image synthesis and convert voices into face images to show correlations, but working on the image domain unavoidably involves predicting attributes that voices cannot hint, including facial textures, hairstyles, and backgrounds. We instead investigate the ability to reconstruct 3D faces to concentrate on only geometry, which is much more physiologically grounded. We propose our analysis framework, Cross-Modal Perceptionist, under both supervised and unsupervised learning. First, we construct a dataset, Voxceleb-3D, which extends Voxceleb and includes paired voices and face meshes, making supervised learning possible. Second, we use a knowledge distillation mechanism to study whether face geometry can still be gleaned from voices without paired voices and 3D face data under limited availability of 3D face scans. We break down the core question into four parts and perform visual and numerical analyses as responses to the core question. Our findings echo those in physiology and neuroscience about the correlation between voices and facial structures. The work provides future human-centric cross-modal learning with explainable foundations. See our project page.

show abstract

Audio-Visual Speaker Verification via Joint Cross-Attention

Rajasekhar,

Alam

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

A Multi-View Approach to Audio-Visual Speaker Verification

Cited by 23 publications

References 20 publications

Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

Learning Lip-Based Audio-Visual Speaker Embeddings with AV-HuBERT

Cross-Modal Perceptionist: Can Face Geometry be Gleaned from Voices?

Audio-Visual Speaker Verification via Joint Cross-Attention

Contact Info

Product

Resources

About