The Speakers in the Wild (SITW) Speaker Recognition Database

McLaren, Mitchell; Ferrer, Luciana; Castán, Diego; Lawson, Aaron

doi:10.21437/interspeech.2016-1129

Cited by 229 publications

(163 citation statements)

References 6 publications

Supporting

Mentioning

161

Contrasting

Unclassified

Order By: Relevance

“…VoxCeleb: The entire dataset involves two parts: VoxCeleb1 and VoxCeleb2. We used SITW [22], a subset of VoxCeleb1 as the evaluation set. The rest of VoxCeleb1 was merged with VoxCeleb2 to form the training set (simply denoted by Vox-Celeb).…”

Section: Datamentioning

confidence: 99%

CN-Celeb: A Challenging Chinese Speaker Recognition Dataset

Fan

Kang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

115

View full text Add to dashboard Cite

Recently, researchers set an ambitious goal of conducting speaker recognition in unconstrained conditions where the variations on ambient, channel and emotion could be arbitrary. However, most publicly available datasets are collected under constrained environments, i.e., with little noise and limited channel variation. These datasets tend to deliver over optimistic performance and do not meet the request of research on speaker recognition in unconstrained conditions.In this paper, we present CN-Celeb, a large-scale speaker recognition dataset collected 'in the wild'. This dataset contains more than 130, 000 utterances from 1, 000 Chinese celebrities, and covers 11 different genres in real world. Experiments conducted with two state-of-the-art speaker recognition approaches (i-vector and x-vector) show that the performance on CN-Celeb is far inferior to the one obtained on VoxCeleb, a widely used speaker recognition dataset. This result demonstrates that in real-life conditions, the performance of existing techniques might be much worse than it was thought. Our database is free for researchers and can be downloaded from http://project.cslt.org.

show abstract

Section: Datamentioning

confidence: 99%

CN-Celeb: A Challenging Chinese Speaker Recognition Dataset

Fan

Kang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

115

View full text Add to dashboard Cite

show abstract

“…We report our results using metrics Equal Error Rate (EER) in % and DCF (Detection Cost Function) [21] under two testing conditions of SITW corpus: Core-Core and Assist-Multi [2]. We refer to the adaptation system trained with LT data as Adaptation system LT.…”

Section: Results For Mic-tel Adaptationmentioning

confidence: 99%

“…Speaker recognition technology has made great progress in the last decade. The x-vector approach [1] is the current state-of-the-art in this field, providing superior performance in NIST SRE, Speakers In The Wild (SITW) [2] and Vox-Celeb datasets [3]. x-vectors is a data-hungry approach, i.e., it requires a huge amount of labeled data (∼ 10k speakers with multiple recordings per speaker) to be properly trained.…”

Section: Introductionmentioning

confidence: 99%

“…Also in [9], the authors create target domain data by augmenting real noisy data with data created by artificially adding noise to the source domain. In our previous work [13], we improved the performance of the speaker recognition system trained on telephone corpus on Speakers In The Wild (SITW) [2], a microphone corpus. For that, we used development portion of SITW and the much larger VoxCeleb1 dataset [3] as the target domain data to learn the feature mapping function.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Low-Resource Domain Adaptation for Speaker Recognition Using Cycle-Gans

Nidadavolu

Kataria

Villalba

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

View full text Add to dashboard Cite

Current speaker recognition technology provides great performance with the x-vector approach. However, performance decreases when the evaluation domain is different from the training domain, an issue usually addressed with domain adaptation approaches. Recently, unsupervised domain adaptation using cycle-consistent Generative Adversarial Networks (CycleGAN) has received a lot of attention. Cycle-GAN learn mappings between features of two domains given non-parallel data. We investigate their effectiveness in low resource scenario i.e. when limited amount of target domain data is available for adaptation, a case unexplored in previous works. We experiment with two adaptation tasks: microphone to telephone and a novel reverberant to clean adaptation with the end goal of improving speaker recognition performance. Number of speakers present in source and target domains are 7000 and 191 respectively. By adding noise to the target domain during CycleGAN training, we were able to achieve better performance compared to the adaptation system whose CycleGAN was trained on a larger target data. On reverberant to clean adaptation task, our models improved EER by 18.3% relative on VOiCES dataset compared to a system trained on clean data. They also slightly improved over the state-of-the-art Weighted Prediction Error (WPE) de-reverberation algorithm.

show abstract

“…SITW-Eval.Core: A standard free database collected by [23] for ASV evaluation. It was collected from open-source media channels, and consists of speech data covering 299 well-known persons.…”

Section: A Datamentioning

confidence: 99%

VAE-based Domain Adaptation for Speaker Verification

Wang

2019

2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

View full text Add to dashboard Cite

Deep speaker embedding has achieved satisfactory performance in speaker verification. By enforcing the neural model to discriminate the speakers in the training set, deep speaker embedding (called 'x-vectors') can be derived from the hidden layers. Despite its good performance, the present embedding model is highly domain sensitive, which means that it often works well in domains whose acoustic condition matches that of the training data (in-domain), but degrades in mismatched domains (out-of-domain). In this paper, we present a domain adaptation approach based on Variational Auto-Encoder (VAE). This model transforms x-vectors to a regularized latent space; within this latent space, a small amount of data from the target domain is sufficient to accomplish the adaptation. Our experiments demonstrated that by this VAE-adaptation approach, speaker embeddings can be easily transformed to the target domain, leading to noticeable performance improvement.

show abstract

The Speakers in the Wild (SITW) Speaker Recognition Database

Cited by 229 publications

References 6 publications

CN-Celeb: A Challenging Chinese Speaker Recognition Dataset

CN-Celeb: A Challenging Chinese Speaker Recognition Dataset

Low-Resource Domain Adaptation for Speaker Recognition Using Cycle-Gans

VAE-based Domain Adaptation for Speaker Verification

Contact Info

Product

Resources

About