The SJTU Robust Anti-Spoofing System for the ASVspoof 2019 Challenge

Yang, Yexin; Wang, Hongji; Dinkel, Heinrich; Chen, Zhengyang; Wang, Shuai; Ye, Qian; Yu, Kai

doi:10.21437/interspeech.2019-2170

Cited by 42 publications

(21 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…During testing, we use the CNN output activation (sigmoid activation) as our spoof detection score. Though another recent study also used VAEs for feature extraction [40], our approach is different; the authors of [40] used the latent variable from a pretrained VAE model, while we use the residual of the original and reconstructed inputs. Table 9 summarizes the results.…”

Section: Vae As a Feature Extractormentioning

confidence: 99%

Deep generative variational autoencoding for replay spoof detection in automatic speaker verification

Chettri

Kinnunen

Benetos

2020

Computer Speech & Language

View full text Add to dashboard Cite

Automatic speaker verification (ASV) systems are highly vulnerable to presentation attacks, also called spoofing attacks. Replay is among the simplest attacks to mount -yet difficult to detect reliably. The generalization failure of spoofing countermeasures (CMs) has driven the community to study various alternative deep learning CMs. The majority of them are supervised approaches that learn a human-spoof discriminator. In this paper, we advocate a different, deep generative approach that leverages from powerful unsupervised manifold learning in classification. The potential benefits include the possibility to sample new data, and to obtain insights to the latent features of genuine and spoofed speech. To this end, we propose to use variational autoencoders (VAEs) as an alternative backend for replay attack detection, via three alternative models that differ in their class-conditioning. The first one, similar to the use of Gaussian mixture models (GMMs) in spoof detection, is to train independently two VAEs -one for each class. The second one is to train a single conditional model (C-VAE) by injecting a one-hot class label vector to the encoder and decoder networks. Our final proposal integrates an auxiliary classifier to guide the learning of the latent space. Our experimental results using constant-Q cepstral coefficient (CQCC) features on the ASVspoof 2017 and 2019 physical access subtask datasets indicate that the C-VAE offers substantial improvement in comparison to training two separate VAEs for each class. On the 2019 dataset, the C-VAE outperforms the VAE and the baseline GMM by an absolute 9 -10% in both equal error rate (EER) and tandem detection cost function (t-DCF) metrics. Finally, we propose VAE residuals -the absolute difference of the original input and the reconstruction as features for spoofing detection. The proposed frontend approach augmented with a convolutional neural network classifier demonstrated substantial improvement over the VAE backend use case.

show abstract

Section: Vae As a Feature Extractormentioning

confidence: 99%

Deep generative variational autoencoding for replay spoof detection in automatic speaker verification

Chettri

Kinnunen

Benetos

2020

Computer Speech & Language

View full text Add to dashboard Cite

show abstract

“…In this study, we use long-term CQT based log power spectrum (LPS) as input to the LCNN system similar to that in [28]. The static dimension of LPS is 84, where the number of octaves is 7 and the number of frequency bins in every octaves is 12.…”

Section: Methodsmentioning

confidence: 99%

“…Later, the constant-Q cepstral coefficients (CQCC) [14] derived from long-term constant-Q transform (CQT) emerged as a promising front-end that led to proposal of several handcrafted features along that direction [15][16][17][18]. In the recent years, robust deep learning classifiers such as squeeze excitation residual networks [19,20] and end-to-end systems with light convolutional neural networks (LCNN) [21,22] are found to be effective for detection of spoofing attacks.…”

Section: Introductionmentioning

confidence: 99%

Data Augmentation with Signal Companding for Detection of Logical Access Attacks

Das

Yang

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

The recent advances in voice conversion (VC) and textto-speech (TTS) make it possible to produce natural sounding speech that poses threat to automatic speaker verification (ASV) systems. To this end, research on spoofing countermeasures has gained attention to protect ASV systems from such attacks. While the advanced spoofing countermeasures are able to detect known nature of spoofing attacks, they are not that effective under unknown attacks. In this work, we propose a novel data augmentation technique using a-law and mu-law based signal companding. We believe that the proposed method has an edge over traditional data augmentation by adding small perturbation or quantization noise. The studies are conducted on ASVspoof 2019 logical access corpus using light convolutional neural network based system. We find that the proposed data augmentation technique based on signal companding outperforms the state-of-the-art spoofing countermeasures showing ability to handle unknown nature of attacks.

show abstract

“…We adopt the Light CNN architecture as the discriminator, which was the best system in the ASVspoof 2017 Challenge [20]. It also performed well in the ASVspoof 2019 Challenge in both replay and synthetic speech discrimination sub-tasks [21,22]. The detailed model structure is the same as that of our previous work [23].…”

Section: Synthesized Speech Discriminator Setupmentioning

confidence: 99%

Towards Data Selection on TTS Data for Children’s Speech Recognition

Wang

Zhou

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Although great progress has been made on automatic speech recognition (ASR) systems, children's speech recognition still remains a challenging task. General ASR systems for children's speech suffer from the lack of corpora and mismatch between children's and adults' speech. Efforts have been made to reduce such mismatch by applying normalization methods to generate modified adults' speech for ASR training. However, modified adults' data can reflect the characteristics of children's speech to a very limited extent. In this work, we adopt text-to-speech data augmentation to improve the performance of children's speech recognition system. We find that the children's TTS model generates speech with inconsistent quality due to children's substandard pronunciations of phonemes, and the ASR system suffers when trained with these additional synthesized data. To solve this problem, we propose data selection strategies on the TTS augmented data, and the effectiveness of the synthesized data can be substantially boosted for children's ASR modeling. We show that the speaker embedding similarity based data selection strategy can obtain the best position: relative 14.0% and 14.7% CER reduction for child conversation and child reading test set respectively compared to the baseline model trained on real data.

show abstract

The SJTU Robust Anti-Spoofing System for the ASVspoof 2019 Challenge

Cited by 42 publications

References 0 publications

Deep generative variational autoencoding for replay spoof detection in automatic speaker verification

Deep generative variational autoencoding for replay spoof detection in automatic speaker verification

Data Augmentation with Signal Companding for Detection of Logical Access Attacks

Towards Data Selection on TTS Data for Children’s Speech Recognition

Contact Info

Product

Resources

About