Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1929
|View full text |Cite
|
Sign up to set email alerts
|

VoxCeleb2: Deep Speaker Recognition

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
959
0
3

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 1,458 publications
(963 citation statements)
references
References 21 publications
1
959
0
3
Order By: Relevance
“…We train our model on VoxCeleb2 [6], a large-scale audio-visual dataset of interviews obtained from unedited YouTube videos. The dataset consists of over a million utterances for 6,112 identities.…”
Section: Datasetmentioning
confidence: 99%
See 1 more Smart Citation
“…We train our model on VoxCeleb2 [6], a large-scale audio-visual dataset of interviews obtained from unedited YouTube videos. The dataset consists of over a million utterances for 6,112 identities.…”
Section: Datasetmentioning
confidence: 99%
“…The motivation for doing so is simple: unlike earlier datasets such as TIMIT [5] that are carefully balanced for phonetic and dialectal coverage, more modern (and larger) datasets created from uncontrolled speech 'in the wild' are likely to contain a strong correlation between identity and linguistic content. For example, VoxCeleb2 [6] consists of interviews of famous celebrities from a wide variety of professions, whose speech can be closely tied to their occupation-the cricketer Adam Gilchrist says the word 'cricket' 17 times and 'president' 0 times; whereas the politician Nancy Pelosi says the word 'President' 88 times, 'Democrats' 19 times and 'cricket' 0 times. Consequently a model trained to represent identity may be incentivised to use linguistic content as a discriminative cue.…”
Section: Introductionmentioning
confidence: 99%
“…To introduce more speaker variability during training, we additionally use data from VoxCeleb 1 and 2 [22,23]. We have kept all the speakers with more than 6 utterances each, resulting to a subset containing 6,490 speakers.…”
Section: Datasetsmentioning
confidence: 99%
“…Biometric authentication systems are generally based on either physiological biometrics such as fingerprints [1], face [2], [3], and voice [4], [5]), or behavioral biometrics such as touch [6] and gait [7], the latter category generally used for continuous and implicit authentication of users. These systems are mostly based on machine learning: a binary classifier is trained on the target user's data (positive class) and a subset of data from other users (negative class).…”
Section: Introductionmentioning
confidence: 99%