Variational Domain Adversarial Learning for Speaker Verification

Tu, Youzhi; Mak, Man‐Wai; Chien, Jen‐Tzung

doi:10.21437/interspeech.2019-2168

Cited by 50 publications

(24 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Note that the posterior for the spoof class is 1 − r ψ (x) as there are two classes. Inspired by [38] and [44], we consider two different AC setups. First, following [38], we use the mean µ z as the input to an AC which is a feedforward neural network with a single hidden layer.…”

Section: Conditioning Vaes By Class Labelmentioning

confidence: 99%

See 1 more Smart Citation

Deep generative variational autoencoding for replay spoof detection in automatic speaker verification

Chettri

Kinnunen

Benetos

2020

Computer Speech & Language

View full text Add to dashboard Cite

Automatic speaker verification (ASV) systems are highly vulnerable to presentation attacks, also called spoofing attacks. Replay is among the simplest attacks to mount -yet difficult to detect reliably. The generalization failure of spoofing countermeasures (CMs) has driven the community to study various alternative deep learning CMs. The majority of them are supervised approaches that learn a human-spoof discriminator. In this paper, we advocate a different, deep generative approach that leverages from powerful unsupervised manifold learning in classification. The potential benefits include the possibility to sample new data, and to obtain insights to the latent features of genuine and spoofed speech. To this end, we propose to use variational autoencoders (VAEs) as an alternative backend for replay attack detection, via three alternative models that differ in their class-conditioning. The first one, similar to the use of Gaussian mixture models (GMMs) in spoof detection, is to train independently two VAEs -one for each class. The second one is to train a single conditional model (C-VAE) by injecting a one-hot class label vector to the encoder and decoder networks. Our final proposal integrates an auxiliary classifier to guide the learning of the latent space. Our experimental results using constant-Q cepstral coefficient (CQCC) features on the ASVspoof 2017 and 2019 physical access subtask datasets indicate that the C-VAE offers substantial improvement in comparison to training two separate VAEs for each class. On the 2019 dataset, the C-VAE outperforms the VAE and the baseline GMM by an absolute 9 -10% in both equal error rate (EER) and tandem detection cost function (t-DCF) metrics. Finally, we propose VAE residuals -the absolute difference of the original input and the reconstruction as features for spoofing detection. The proposed frontend approach augmented with a convolutional neural network classifier demonstrated substantial improvement over the VAE backend use case.

show abstract

Section: Conditioning Vaes By Class Labelmentioning

confidence: 99%

“…Inspired by [38] and [44], we consider two different AC setups. First, following [38], we use the mean µ z as the input to an AC which is a feedforward neural network with a single hidden layer. Second, following [44], we augment a deep-CNN as an AC to the output of the decoder network.…”

Section: Conditioning Vaes By Class Labelmentioning

confidence: 99%

Deep generative variational autoencoding for replay spoof detection in automatic speaker verification

Chettri

Kinnunen

Benetos

2020

Computer Speech & Language

View full text Add to dashboard Cite

show abstract

“…A promising research direction in this context is domain adversarial training to make speaker representations robust to recording conditions [12][13][14][15]. However, a majority of these techniques are supervised, i.e., they require labelled nuisance factors, which might not be readily available in many real-world scenarios.…”

Section: Introductionmentioning

confidence: 99%

Robust Speaker Recognition Using Unsupervised Adversarial Invariance

Peri

Pal

Jati

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, we address the problem of speaker recognition in challenging acoustic conditions using a novel method to extract robust speaker-discriminative speech representations. We adopt a recently proposed unsupervised adversarial invariance architecture to train a network that maps speaker embeddings extracted using a pretrained model onto two lower dimensional embedding spaces. The embedding spaces are learnt to disentangle speaker-discriminative information from all other information present in the audio recordings, without supervision about the acoustic conditions. We analyze the robustness of the proposed embeddings to various sources of variability present in the signal for speaker verification and unsupervised clustering tasks on a large-scale speaker recognition corpus. Our analyses show that the proposed system substantially outperforms the baseline in a variety of challenging acoustic scenarios. Furthermore, for the task of speaker diarization on a real-world meeting corpus, our system shows a relative improvement of 36% in the diarization error rate compared to the state-of-the-art baseline.

show abstract

“…Previous work in adversarial learning of speaker representation has encouraged domain invariance by having an adversary classify the dataset or labelled environment to which the generated features belong [4,12]. However, this is a coarse modelling of the domains over which generated features are encouraged to be invariant.…”

Section: Introductionmentioning

confidence: 99%

“…However, this is a coarse modelling of the domains over which generated features are encouraged to be invariant. In the case of dataset adversarial training [12], for instance, intra-dataset variation is not penalized, instead relying on the differences between datasets being enough to encourage meaningful invariance.…”

Section: Introductionmentioning

confidence: 99%

Channel Adversarial Training for Speaker Verification and Diarization

Luu

Bell

Renals

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Previous work has encouraged domain-invariance in deep speaker embedding by adversarially classifying the dataset or labelled environment to which the generated features belong. We propose a training strategy which aims to produce features that are invariant at the granularity of the recording or channel, a finer grained objective than dataset-or environmentinvariance. By training an adversary to predict whether pairs of same-speaker embeddings belong to the same recording in a Siamese fashion, learned features are discouraged from utilizing channel information that may be speaker discriminative during training. Experiments for verification on Vox-Celeb and diarization and verification on CALLHOME show promising improvements over a strong baseline in addition to outperforming a dataset-adversarial model. The VoxCeleb model in particular performs well, achieving a 4% relative improvement in EER over a Kaldi baseline, while using a similar architecture and less training data.

show abstract

Variational Domain Adversarial Learning for Speaker Verification

Cited by 50 publications

References 22 publications

Deep generative variational autoencoding for replay spoof detection in automatic speaker verification

Deep generative variational autoencoding for replay spoof detection in automatic speaker verification

Robust Speaker Recognition Using Unsupervised Adversarial Invariance

Channel Adversarial Training for Speaker Verification and Diarization

Contact Info

Product

Resources

About