Voices Obscured in Complex Environmental Settings (VOiCES) Corpus

Richey, Colleen; Barrios, María Auxiliadora; Armstrong, Zeb; Bartels, Chris; Franco, Horacio; Graciarena, Martin; Lawson, Aaron; Nandwana, Mahesh Kumar; Stauffer, Allen; Hout, Julien van; Gamble, Paul; Hetherly, Jeffrey Wayne; Stephenson, Cory; Ni, Karl

doi:10.21437/interspeech.2018-1454

Cited by 87 publications

(66 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…V19-eval and V19-dev: We use the VOiCES data corpus [22] to evaluate the performance of our system with respect to the baselines on a speaker verification task and perform probing tasks to examine the systems. It consists of recordings collected from 4 different rooms with microphones placed at various fixed locations, while a loudspeaker played clean speech samples from the Librispeech [23] dataset.…”

Section: Datasetsmentioning

confidence: 99%

“…Vox: Our training data consists of a combination of the development and test splits of VoxCeleb2 [25] and the development split [22]. Distractor represents noise source and green circles represents microphones Refers to number of sessions of VoxCeleb1 [26] datasets.…”

Section: Datasetsmentioning

confidence: 99%

“…Consistent with general practice [6], equal error rate (EER) was used as the metric for evaluation. Following [22] and [27], we used knowledge of the nuisance factors annotations available in V19-eval dataset to study the various factors affecting the performance of speaker verification. For these experiments we consider two distinct nuisance factors, noise conditions: none, babble and television and microphone location: far-mic, near-mic and obstructed-mic (microphone hidden in the ceiling).…”

Section: Setupmentioning

confidence: 99%

See 2 more Smart Citations

Robust Speaker Recognition Using Unsupervised Adversarial Invariance

Peri

Pal

Jati

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, we address the problem of speaker recognition in challenging acoustic conditions using a novel method to extract robust speaker-discriminative speech representations. We adopt a recently proposed unsupervised adversarial invariance architecture to train a network that maps speaker embeddings extracted using a pretrained model onto two lower dimensional embedding spaces. The embedding spaces are learnt to disentangle speaker-discriminative information from all other information present in the audio recordings, without supervision about the acoustic conditions. We analyze the robustness of the proposed embeddings to various sources of variability present in the signal for speaker verification and unsupervised clustering tasks on a large-scale speaker recognition corpus. Our analyses show that the proposed system substantially outperforms the baseline in a variety of challenging acoustic scenarios. Furthermore, for the task of speaker diarization on a real-world meeting corpus, our system shows a relative improvement of 36% in the diarization error rate compared to the state-of-the-art baseline.

show abstract

Section: Datasetsmentioning

confidence: 99%

Section: Datasetsmentioning

confidence: 99%

Section: Setupmentioning

confidence: 99%

See 1 more Smart Citation

Robust Speaker Recognition Using Unsupervised Adversarial Invariance

Peri

Pal

Jati

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…For each data augmentation, we randomly choose from 2000 room impulse responses generated from Pyroomacoustics [21], and add randomly selected background noise from MUSAN [22] and AudioSet [23]. For the test set, we used the VOiCES far-field dataset [4], which we believe captures the essence of challenging channel conditions. For all speech utterances, we use 40-dimension log-mel filterbanks, with 3-second sliding window mean subtraction.…”

Section: Datasets and Augmentationmentioning

confidence: 99%

“…A number of speaker recognition systems based on deep neural network (DNN) embeddings have been reported in the literature [1][2] [3]. More recently, SRI developed the VOiCES dataset [4] specifically for far-field speaker recognition, and showed their DNN embeddings significantly outperformed the i-vector systems [5].…”

Section: Introductionmentioning

confidence: 99%

Structural Sparsification for Far-Field Speaker Recognition with Intel® Gna

Zhang

Huang

Deisher

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Recently, deep neural networks (DNN) have been widely used in speaker recognition area. In order to achieve fast response time and high accuracy, the requirements for hardware resources increase rapidly. However, as the speaker recognition application is often implemented on mobile devices, it is necessary to maintain a low computational cost while keeping high accuracy in far-field condition. In this paper, we apply structural sparsification on time-delay neural networks (TDNN) to remove redundant structures and accelerate the execution. On our targeted hardware, our model can remove 60% of parameters and only slightly increasing equal error rate (EER) by 0.18% while our structural sparse model can achieve more than 2× speedup.

show abstract