ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9414260
|View full text |Cite
|
Sign up to set email alerts
|

A Multi-View Approach to Audio-Visual Speaker Verification

Abstract: Although speaker verification has conventionally been an audio-only task, some practical applications provide both audio and visual streams of input. In these cases, the visual stream provides complementary information and can often be leveraged in conjunction with the acoustics of speech to improve verification performance. In this study, we explore audio-visual approaches to speaker verification, starting with standard fusion techniques to learn joint audio-visual (AV) embeddings, and then propose a novel ap… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 23 publications
(9 citation statements)
references
References 20 publications
0
9
0
Order By: Relevance
“…As is expected, fine-tuning with a large amount of labeled data improves performance. In audio-visual setting, our best model (0.84%) outperforms [33] (1.8% and 1.4%) with a single model and slightly falls behind its ensembled model (0.7%). Note in contrast to the prior works [18,33,32] which uses the whole face, our model only relies on the lip area of the speaker as visual input and achieves a better trade-off between privacy and performance.…”
Section: Comparison With Prior Workmentioning
confidence: 86%
See 3 more Smart Citations
“…As is expected, fine-tuning with a large amount of labeled data improves performance. In audio-visual setting, our best model (0.84%) outperforms [33] (1.8% and 1.4%) with a single model and slightly falls behind its ensembled model (0.7%). Note in contrast to the prior works [18,33,32] which uses the whole face, our model only relies on the lip area of the speaker as visual input and achieves a better trade-off between privacy and performance.…”
Section: Comparison With Prior Workmentioning
confidence: 86%
“…In audio-visual setting, our best model (0.84%) outperforms [33] (1.8% and 1.4%) with a single model and slightly falls behind its ensembled model (0.7%). Note in contrast to the prior works [18,33,32] which uses the whole face, our model only relies on the lip area of the speaker as visual input and achieves a better trade-off between privacy and performance. In addition, we acknowledge the gap between our best model and the current SOTA on VC1 ( [25]: 0.38%).…”
Section: Comparison With Prior Workmentioning
confidence: 86%
See 2 more Smart Citations
“…These tasks are inherently selection problems in which the best fit of a voice-face pair from the dataset is desired. Another similar task is cross-modal verification [32,48,52] that tells whether input faces and voices belong to the same person, which is a simply classification problem for paired inputs. Our work solves its root question and explains the success in voice-face matching or verification by verifying correlations between voices and face geometry.…”
Section: Audio-visual Learningmentioning
confidence: 99%