Robust Speaker Recognition Based on Single-Channel and Multi-Channel Speech Enhancement

Taherian, Hassan; Wang, Zhong-Qiu; Chang, Jiangfang; Wang, DeLiang

doi:10.1109/taslp.2020.2986896

Cited by 48 publications

(30 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We find improvements with the both proposed model and loss function in terms of lower word error rate (WER) in the LibriCSS dataset [11]. Note that although our focus in this paper is single-channel separation, our approach can be easily extended to multi-channel processing using masking based beamforming [1,12].…”

Section: Introductionmentioning

confidence: 90%

Time-Domain Loss Modulation Based on Overlap Ratio for Monaural Conversational Speaker Separation

Taherian

Wang

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Existing speaker separation methods deliver excellent performance on fully overlapped signal mixtures. To apply these methods in daily conversations that include occasional concurrent speakers, recent studies incorporate both overlapped and non-overlapped segments in the training data. However, such training data can degrade the separation performance due to triviality of non-overlapped segments where the model reflects the input to the output. We propose a new loss function for speaker separation based on permutation invariant training that dynamically reweighs losses using the segment overlap ratio. The new loss function emphasizes overlapped regions while deemphasizing the segments with single speakers. We demonstrate the effectiveness of the proposed loss function on an automatic speech recognition (ASR) task. Experiments on the recently introduced LibriCSS corpus show that our proposed single-channel method produces consistent improvements compared to baseline methods.

show abstract

Section: Introductionmentioning

confidence: 90%

Time-Domain Loss Modulation Based on Overlap Ratio for Monaural Conversational Speaker Separation

Taherian

Wang

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Later performance investigation was done using delay and sum beam-forming to decrease the word error rate in identification of speech signal. Also auto regression-based gaussian distribution and Laplacian distribution is used for enhancing speech signal are described in [7][8][9][10][11][12]. The proposed Laplacian prior estimators minimize unnecessary noise signals in desired speech signals.…”

Section: Introductionmentioning

confidence: 99%

An Efficient Reference Free Adaptive Learning Process for Speech Enhancement Applications

Jyoshna¹,

Rahman²,

KoteswaraRao³

2022

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

In issues like hearing impairment, speech therapy and hearing aids play a major role in reducing the impairment. Removal of noise signals from speech signals is a key task in hearing aids as well as in speech therapy. During the transmission of speech signals, several noise components contaminate the actual speech components. This paper addresses a new adaptive speech enhancement (ASE) method based on a modified version of singular spectrum analysis (MSSA). The MSSA generates a reference signal for ASE and makes the ASE is free from feeding reference component. The MSSA adopts three key steps for generating the reference from the contaminated speech only. These are decomposition, grouping and reconstruction. The generated reference is taken as a reference for variable size adaptive learning algorithms. In this work two categories of adaptive learning algorithms are used. They are step variable adaptive learning (SVAL) algorithm and time variable step size adaptive learning (TVAL). Further, sign regressor function is applied to adaptive learning algorithms to reduce the computational complexity of the proposed adaptive learning algorithms. The performance measures of the proposed schemes are calculated in terms of signal to noise ratio improvement (SNRI), excess mean square error (EMSE) and misadjustment (MSD). For cockpit noise these measures are found to be 29.2850, -27.6060 and 0.0758 dB respectively during the experiments using SVAL algorithm. By considering the reduced number of multiplications the sign regressor version of SVAL based ASE method is found to better then the counter parts.

show abstract

“…The speaker attention module extracts the target speaker's voice, that is further encoded by the speaker representation module into a discriminative speaker embedding for effective speaker verification. There have been studies on joint optimization between speech enhancement and speaker verification [32]- [34]. Along a similar line of thought, we propose to jointly optimize a speaker attention module and a speaker representation module by simultaneously minimizing a signal reconstruction loss and a speaker identity loss.…”

Section: Introductionmentioning

confidence: 99%

Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech

Rao

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

Speaker verification has been studied mostly under the single-talker condition. It is adversely affected in the presence of interference speakers. Inspired by the study on target speaker extraction, e.g., SpEx, we propose a unified speaker verification framework for both single-and multi-talker speech, that is able to pay selective auditory attention to the target speaker. This target speaker verification (tSV) framework jointly optimizes a speaker attention module and a speaker representation module via multitask learning. We study four different target speaker embedding schemes under the tSV framework. The experimental results show that all four target speaker embedding schemes significantly outperform other competitive solutions for multi-talker speech. Notably, the best tSV speaker embedding scheme achieves 76.0% and 55.3% relative improvements over the baseline system on the WSJ0-2mix-extr and Libri2Mix corpora in terms of equalerror-rate for 2-talker speech, while the performance of tSV for single-talker speech is on par with that of traditional speaker verification system, that is trained and evaluated under the same single-talker condition.

show abstract

Robust Speaker Recognition Based on Single-Channel and Multi-Channel Speech Enhancement

Cited by 48 publications

References 46 publications

Time-Domain Loss Modulation Based on Overlap Ratio for Monaural Conversational Speaker Separation

Time-Domain Loss Modulation Based on Overlap Ratio for Monaural Conversational Speaker Separation

An Efficient Reference Free Adaptive Learning Process for Speech Enhancement Applications

Target Speaker Verification With Selective Auditory Attention for Single and Multi-Talker Speech

Contact Info

Product

Resources

About