Fast i-vector denoising using MAP estimation and a noise distributions database for robust speaker recognition

Kheder, Waad Ben; Matrouf, Driss; Bousquet, Pierre-Michel; Bonastre, J.-F.; Ajili, Moez

doi:10.1016/j.csl.2016.12.007

Cited by 18 publications

(13 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the previous generation of speaker recognition systems (i.e., i-vector) the mapping from distorted to clean vectors is explored broadly. i-MAP [14] and joint i-MAP [15] are two statistical techniques used for noise compensation in the i-vector framework.…”

Section: Related Workmentioning

confidence: 99%

Compensate multiple distortions for speaker recognition systems

Mohammadamini

Matrouf

Bonastre

et al. 2021

2021 29th European Signal Processing Conference (EUSIPCO)

Self Cite

View full text Add to dashboard Cite

The performance of speaker recognition systems reduces dramatically in severe conditions in the presence of additive noise and/or reverberation. In some cases, there is only one kind of domain mismatch like additive noise or reverberation, but in many cases, there are more than one distortion. Finding a solution for domain adaptation in the presence of different distortions is a challenge. In this paper we investigate the situation in which there is none, one or more of the following distortions: early reverberation, full reverberation, additive noise. We propose two configurations to compensate for these distortions. In the first one a specific denoising autoencoder is used for each distortion. In the second configuration, a denoising autoencoder is used to compensate for all of these distortions simultaneously. Our experiments show that, in the co-existence of noise and reverberation, the second configuration gives better results. For example, with the second configuration we obtained 76.6% relative improvement of EER for utterances longer than 12 seconds. For other situations in the presence of only one distortion, the second configuration gives almost the same results achieved by using a specific model for each distortion.

show abstract

Section: Related Workmentioning

confidence: 99%

Compensate multiple distortions for speaker recognition systems

Mohammadamini

Matrouf

Bonastre

et al. 2021

2021 29th European Signal Processing Conference (EUSIPCO)

Self Cite

View full text Add to dashboard Cite

show abstract

“…I-MAP is a statistical denoising method that is applied in the ivector space. The main advantage of this method is that it uses both information about the relation between noisy and clean speech and the clean speech distribution [5]. A nonparametric algorithm without considering the relation between corrupted and clean i-vector was proposed by [4], that utilizes the joint distribution of corrupted and clean i-vectors to denoise corrupted i-vector with an MMSE estimator.…”

Section: Related Workmentioning

confidence: 99%

“…Applying denoising techniques at the speaker modeling level has been done successfully in the i-vector space [4,5,6]. In this paper we apply statistical denoising techniques on xvectors that works effectively in i-vector domain.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Denoising x-vectors for Robust Speaker Recognition

Mohammadamini

Matrouf

Noé

2020

The Speaker and Language Recognition Workshop (Odyssey 2020)

Self Cite

View full text Add to dashboard Cite

Using deep learning methods has led to significant improvement in speaker recognition systems. Introducing xvectors as a speaker modeling method has made these systems more robust. Since, in challenging environments with noise and reverberation, the performance of x-vectors systems degrades significantly, the demand for denoising techniques remains as before. In this paper, for the first time, we try to denoise the xvectors speaker embedding. Our focus is on additive noise. Firstly, we use the i-MAP method which considers that both noise and clean x-vectors have a Gaussian distribution. Then, leveraging denoising autoencoders (DAE) we try to reconstruct the clean x-vector from the corrupted version. After that, we propose two hybrid systems composed of statistical i-MAP and DAE. Finally, we propose a novel DAE architecture, named Deep Stacked DAE, composed of several DAEs where each DAE receives as input the output of its predecessor DAE concatenated with the difference between noisy x-vectors and its predecessor's output. The experiments on Fabiol corpus show that the results given by the hybrid DAE i-MAP method in several cases outperforms the conventional DAE and i-MAP methods. Also, the results for Deep Stacked DAE in most cases is better than the other proposed methods. For utterances longer than 12 seconds we achieved a 51% improvement in terms of EER with Deep Stacked DAE, and for utterances shorter than 2 seconds, Deep Stacked DAE gives 18% improvements compared to the baseline system.

show abstract

“…The first methodology is to apply a denoising transformation to speaker embeddings. In [7,8,9], researchers use either statistical back end or neural network back end to transform noisy speaker embeddings into enhanced ones. A problem with this methodology is information loss.…”

Section: Introductionmentioning

confidence: 99%

Cam: Context-Aware Masking for Robust Speaker Verification

Zheng

Suo

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Performance degradation caused by noise has been a long-standing challenge for speaker verification. Previous methods usually involve applying a denoising transformation to speaker embeddings or enhancing input features. Nevertheless, these methods are lossy and inefficient for speaker embedding. In this paper, we propose contextaware masking (CAM), a novel method to extract robust speaker embedding. CAM enables the speaker embedding network to "focus" on the speaker of interest and "blur" unrelated noise. The threshold of masking is dynamically controlled by an auxiliary context embedding that captures speaker and noise characteristics. Moreover, models adopting CAM can be trained in an end-to-end manner without using synthesized noisy-clean speech pairs. Our results show that CAM improves speaker verification performance in the wild by a large margin, compared to the baselines.

show abstract

Fast i-vector denoising using MAP estimation and a noise distributions database for robust speaker recognition

Cited by 18 publications

References 20 publications

Compensate multiple distortions for speaker recognition systems

Compensate multiple distortions for speaker recognition systems

Denoising x-vectors for Robust Speaker Recognition

Cam: Context-Aware Masking for Robust Speaker Verification

Contact Info

Product

Resources

About