Task-Specific Optimization of Virtual Channel Linear Prediction-Based Speech Dereverberation Front-End for Far-Field Speaker Verification

Yang, Joon‐Young; Chang, Joon‐Hyuk

doi:10.1109/taslp.2022.3205752

Cited by 1 publication

(4 citation statements)

References 77 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The achieved results show a major improvement in the speaker recognition domain compared to the current state of the art systems, as seen in Table I. Table I compares the proposed VBxVPE system against the top two systems [19], [20] in the SITW Speech Recognition Challenge 2016 [30] along with four state-of-the-art speaker verification and recognition systems [10], [24], [25], [27]. It can be observed that the VBxVPE speaker verification system demonstrates an improved performance on both the single and multi-speaker settings.…”

Section: Resultsmentioning

confidence: 97%

“…Since the VBxVPE system relies on a PLDA model, pre-trained on a large number of speaker-labeled x-vectors [26], the SITW development set [28] was not required at any stage. The SITW evaluation set is composed of a total of 180 different speakers across 2,883 audio files naturally containing overlapping utterances, noise, reverberation, and compression artifacts, making the dataset challenging from a speaker recognition perspective [28], [29].…”

Section: Related Workmentioning

confidence: 99%

“…Recently, systems using the Weighted Prediction Error (WPE) speech dereverberation algorithm for cancelling out reverberation and background noise [27] and generating clean audio signals for extracting speaker embeddings have shown improved performance for speaker verification. The waveform amplitude distribution analysis method was employed to estimate the SNR of the real speech recordings, whereby degraded and noisy audio signals were processed by the Virtual Acoustic Channel Expansion (VACE)-WPE and speaker embeddings were extracted using a pre-trained Resnet-34 Deep Speaker Embedding (DSE) Model employing dereverberation without Task specific Optimization (TSO) (characterized by prefix Drv) [27]. The Drv-VACE-WPE system was able to obtain an EER of 1.46% and minC Det of 0.143 on the 'core-core' evaluation condition of the SITW corpus [28] which surpassed the existing state of the art results.…”

Section: Related Workmentioning

confidence: 99%

“…These systems operated by extracting x-vectors from speech segments, performing LDA and using PLDA classifiers to perform a likelihood ratio test between the enrolled and the test speakers in a verification task. Research employing speech enhancement to cancel out noise, reverberation and normalize distortion from the noisy audio signals have also shown improvement in this domain [27].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Speaker Recognition using Multiple X-Vector Speaker Representations with Two-Stage Clustering and Outlier Detection Refinement

Shrestha

Glackin

Wall

et al. 2022

2022 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf

View full text Add to dashboard Cite

This paper presents a novel Variational Bayes xvector Voice Print Extraction (VBxVPE) system, capable of capturing vocal variations using multiple x-vector representations with two-stage clustering and outlier detection for robust speaker recognition and verification. The presented approach demonstrates beyond the state-of-the-art results when evaluated against the 'core-core' and 'core-multi' evaluation conditions of the Speakers In the Wild dataset, achieving an Equal Error Rate of 1.06%, Cost of Detection score of 0.052, minimum Cost of Detection score of 0.010, Speaker Identification Accuracy of 95.84% with Precision, Recall and F1 score values of 0.964, 0.958 and 0.961, respectively on the 'core-core' evaluation condition and Equal Error Rate of 1.07%, Cost of Detection score of 0.066, minimum Cost of Detection score of 0.010 with Precision, Recall and F1 score values of 0.967, 0.963 and 0.965, respectively on the 'core-multi' evaluation condition.

show abstract