State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations

Villalba, Jesús; Chen, Nanxin; Snyder, David; Garcia‐Romero, Daniel; McCree, Alan; Sell, Gregory; Borgstrom, Jonas; García-Perera, Leibny Paola; Richardson, F. S.; Dehak, Réda; Torres-Carrasquillo, Pedro A.; Dehak, Najim

doi:10.1016/j.csl.2019.101026

Cited by 97 publications

(89 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The auxiliary network in our DFL formulation is the ResNet-34-LDE network described in [14,15,5]. It is a ResNet-34 residual network with Learnable Dictionary Encoding (LDE) pooling and Angular Softmax loss function.…”

Section: Residual Networkmentioning

confidence: 99%

“…Total parameters for ETDNN and FTDNN are 10M and 17M respectively. A summary of those networks can be found in [5].…”

Section: X-vector Networkmentioning

confidence: 99%

“…Task-specific enhancement has been proposed for Automatic Speech Recognition (ASR), Language Recognition, and SV. We focus on single-channel wide-band SV, for which augmented x-vector network with Probabilistic Linear Discriminant Analysis (PLDA) back-end is the state-of-theart (SOTA) [5]. For SV, [6] and [7] have reported improvements on simulated data.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Feature Enhancement with Deep Feature Losses for Speaker Verification

Kataria

Nidadavolu

Villalba

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Speaker Verification still suffers from the challenge of generalization to novel adverse environments. We leverage on the recent advancements made by deep learning based speech enhancement and propose a feature-domain supervised denoising based solution. We propose to use Deep Feature Loss which optimizes the enhancement network in the hidden activation space of a pre-trained auxiliary speaker embedding network. We experimentally verify the approach on simulated and real data. A simulated testing setup is created using various noise types at different SNR levels. For evaluation on real data, we choose BabyTrain corpus which consists of children recordings in uncontrolled environments. We observe consistent gains in every condition over the state-of-the-art augmented Factorized-TDNN x-vector system. On BabyTrain corpus, we observe relative gains of 10.38% and 12.40% in minDCF and EER respectively.

show abstract

Section: Residual Networkmentioning

confidence: 99%

“…Total parameters for ETDNN and FTDNN are 10M and 17M respectively. A summary of those networks can be found in [5].…”

Section: X-vector Networkmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Feature Enhancement with Deep Feature Losses for Speaker Verification

Kataria

Nidadavolu

Villalba

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Total parameters for ETDNN and FTDNN are 10M and 17M respectively. More details on the networks and the pipeline can be found in [3,13].…”

Section: X-vector Architecturesmentioning

confidence: 99%

“…One approach to improve the robustness of SV systems is to train them on data created by artificially adding noise to the original training data or simulating the reverberant speech. This method, known as data augmentation, has proven to be effective in improving the performance of SV systems yielding state-of-the-art (SOTA) results on various tasks [2,3]. However, such simulation strategies do not take into account the amount and type of degradation the test utterances can have.…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Feature Enhancement for Speaker Verification

Nidadavolu

Kataria

Villalba

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

The task of making speaker verification systems robust to adverse scenarios remain a challenging and an active area of research. We developed an unsupervised feature enhancement approach in log-filter bank domain with the end goal of improving speaker verification performance. We experimented with using both real speech recorded in adverse environments and degraded speech obtained by simulation to train the enhancement systems. The effectiveness of the approach was shown by testing on several real, simulated noisy, and reverberant test sets. The approach yielded significant improvements on both real and simulated sets when data augmentation was not used in speaker verification pipeline or augmentation was used only during x-vector training. When data augmentation was used for x-vector and PLDA training, our enhancement approach yielded slight improvements.

show abstract

Classical and Deep Learning Data Processing Techniques for Speech and Speaker Recognitions

Mittal

Dua

2021

Signals and Communication Technology

View full text Add to dashboard Cite

State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations

Cited by 97 publications

References 7 publications

Feature Enhancement with Deep Feature Losses for Speaker Verification

Feature Enhancement with Deep Feature Losses for Speaker Verification

Unsupervised Feature Enhancement for Speaker Verification

Classical and Deep Learning Data Processing Techniques for Speech and Speaker Recognitions

Contact Info

Product

Resources

About