Voxceleb: Large-scale speaker verification in the wild

Nagrani, Arsha; Chung, Joon Son; Xie, Weidi; Zisserman, Andrew

doi:10.1016/j.csl.2019.101027

Cited by 386 publications

(286 citation statements)

References 17 publications

Supporting

Mentioning

284

Contrasting

Order By: Relevance

“…The recording condition and audio quality are less than ideal, but, this corpus is suitable for training speaker encoder networks or generalizing any-to-any speaker mapping network. The VoxCeleb database [306] is further a larger scale speech database consisting of about 2,800 hours of untranscribed speech from over 6,000 speakers. This is an appropriate database for training noise-robust speaker encoder networks.…”

Section: Resourcesmentioning

confidence: 99%

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

Şişman

Yamagishi

King

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

168

View full text Add to dashboard Cite

Section: Resourcesmentioning

confidence: 99%

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

Şişman

Yamagishi

King

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

168

View full text Add to dashboard Cite

“…Stimuli for the speech and non-speech contrast were extracted from large popular datasets for these categories. Speech stimuli were extracted from a human speech-utterance dataset comprising short audio clips of interviews recorded on YouTube (52). Non-speech stimuli were extracted from another large dataset comprising short clips of environmental sounds (53).…”

Section: Stimuli For Synthetic Contrastsmentioning

confidence: 99%

Cortical response to naturalistic stimuli is largely predictable with deep neural networks

Khosla

Ngo

Jamison

et al. 2020

Preprint

View full text Add to dashboard Cite

Naturalistic stimuli, such as movies, activate a substantial portion of the human brain, invoking a response shared across individuals. Encoding models that predict the neural response to a given stimulus can be very useful for studying brain function. However, existing neural encoding models focus on limited aspects of naturalistic stimuli, ignoring the complex and dynamic interactions of modalities in this inherently context-rich paradigm. Using movie watching data from the Human Connectome Project (HCP, N=158) database, we build group-level models of neural activity that incorporate several inductive biases about information processing in the brain, including hierarchical processing, assimilation over longer timescales and multi-sensory auditory-visual interactions. We demonstrate how incorporating this joint information leads to remarkable prediction performance across large areas of the cortex, well beyond the visual and auditory cortices into multi-sensory sites and frontal cortex. Furthermore, we illustrate that encoding models learn high-level concepts that generalize remarkably well to alternate task-bound paradigms. Taken together, our findings underscore the potential of neural encoding models as a powerful tool for studying brain function in ecologically valid conditions.

show abstract

“…All utterances crucially degraded with different types of noises including background chatter, laughter, overlapping speech and room acoustics. Although there are a lot of variations in recording devices and channels [18].…”

Section: Corpusmentioning

confidence: 99%

Denoising x-vectors for Robust Speaker Recognition

Mohammadamini

Matrouf

Noé

2020

The Speaker and Language Recognition Workshop (Odyssey 2020)

View full text Add to dashboard Cite

Using deep learning methods has led to significant improvement in speaker recognition systems. Introducing xvectors as a speaker modeling method has made these systems more robust. Since, in challenging environments with noise and reverberation, the performance of x-vectors systems degrades significantly, the demand for denoising techniques remains as before. In this paper, for the first time, we try to denoise the xvectors speaker embedding. Our focus is on additive noise. Firstly, we use the i-MAP method which considers that both noise and clean x-vectors have a Gaussian distribution. Then, leveraging denoising autoencoders (DAE) we try to reconstruct the clean x-vector from the corrupted version. After that, we propose two hybrid systems composed of statistical i-MAP and DAE. Finally, we propose a novel DAE architecture, named Deep Stacked DAE, composed of several DAEs where each DAE receives as input the output of its predecessor DAE concatenated with the difference between noisy x-vectors and its predecessor's output. The experiments on Fabiol corpus show that the results given by the hybrid DAE i-MAP method in several cases outperforms the conventional DAE and i-MAP methods. Also, the results for Deep Stacked DAE in most cases is better than the other proposed methods. For utterances longer than 12 seconds we achieved a 51% improvement in terms of EER with Deep Stacked DAE, and for utterances shorter than 2 seconds, Deep Stacked DAE gives 18% improvements compared to the baseline system.

show abstract

Voxceleb: Large-scale speaker verification in the wild

Cited by 386 publications

References 17 publications

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

An Overview of Voice Conversion and Its Challenges: From Statistical Modeling to Deep Learning

Cortical response to naturalistic stimuli is largely predictable with deep neural networks

Denoising x-vectors for Robust Speaker Recognition

Contact Info

Product

Resources

About