Analysing The Predictions Of a CNN-Based Replay Spoofing Detection System

Chettri, Bhusan; Mishra, Shailendra; Sturm, Bob L.; Benetos, Emmanouil

doi:10.1109/slt.2018.8639666

Cited by 21 publications

(29 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…We operate on power spectrograms instead of other alternative time-frequency representations, following findings in [25]. All our sub-CNNs use the architecture described in [26], which is an adapted version of the best performing model [25] in the ASVspoof 2017 challenge. It consists of 9 convolutional layers, 5 max-pooling layers and 2 fully connected (FC) layers.…”

Section: Proposed Methodologymentioning

confidence: 99%

“…Following our prior findings [30] on the ASVspoof 2019 PA dataset, we remove zero-valued samples from the start and end of every audio recording in the dataset. Likewise, on the ASVspoof 2017 v2.0 dataset, we remove leading and trailing silence/nonspeech samples following our findings in [26]. For this, we use our publicly released speech endpoint annotations [31].…”

Section: Input Representation and Preprocessingmentioning

confidence: 99%

“…It is worth noting that the ASVspoof 2019 PA dataset was developed through controlled simulation while the ASVspoof 2017 dataset was collected in real world recording and replay conditions. Due to the dataset issues [26,30] identified on both ASVspoof 2017 and 2019 PA, the models trained on these datasets might not be able to capture real replay attack conditions and thus perform poorly on the real PA test set which has been designed carefully reflecting real replay attack conditions. This study further suggests that there is still need for a reliable replay training dataset that can be used to train models incorporating real world replay attack conditions.…”

Section: Experiments 3 -Cross-database Evaluationmentioning

confidence: 99%

See 2 more Smart Citations

Ensemble Models for Spoofing Detection in Automatic Speaker Verification

et al. 2019

Self Cite

View full text Add to dashboard Cite

Spectrograms -time-frequency representations of audio signals -have found widespread use in neural network-based spoofing detection. While deep models are trained on the fullband spectrum of the signal, we argue that not all frequency bands are useful for these tasks. In this paper, we systematically investigate the impact of different subbands and their importance on replay spoofing detection on two benchmark datasets: ASVspoof 2017 v2.0 and ASVspoof 2019 PA. We propose a joint subband modelling framework that employs n different sub-networks to learn subband specific features. These are later combined and passed to a classifier and the whole network weights are updated during training. Our findings on the ASVspoof 2017 dataset suggest that the most discriminative information appears to be in the first and the last 1 kHz frequency bands, and the joint model trained on these two subbands shows the best performance outperforming the baselines by a large margin. However, these findings do not generalise on the ASVspoof 2019 PA dataset. This suggests that the datasets available for training these models do not reflect real world replay conditions suggesting a need for careful design of datasets for training replay spoofing countermeasures.

show abstract

Section: Proposed Methodologymentioning

confidence: 99%

Section: Input Representation and Preprocessingmentioning

confidence: 99%

Section: Experiments 3 -Cross-database Evaluationmentioning

confidence: 99%

See 1 more Smart Citation

Ensemble Models for Spoofing Detection in Automatic Speaker Verification

et al. 2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…Second, following [44], we augment a deep-CNN as an AC to the output of the decoder network. Here, we use the CNN architecture from [45]. From hereon, we call these two setups as AC-VAE 1 and AC-VAE 2 respectively.…”

Section: Conditioning Vaes By Class Labelmentioning

confidence: 99%

“…We consider both CQCC [66] and log-power spectrogram features. We apply a pre-processing step on the raw-audio waveforms to trim silence/noise before and after the utterance in the training, development and test sets, following recommendations in [45] and [76]. Following [84], we extract log energy plus 19dimensional static coefficients augmented with deltas and double-deltas, yielding 60-dimensional feature vectors.…”

Section: Features and Input Representationmentioning

confidence: 99%

Deep generative variational autoencoding for replay spoof detection in automatic speaker verification

Chettri

Kinnunen

Benetos

2020

Computer Speech & Language

Self Cite

View full text Add to dashboard Cite

Automatic speaker verification (ASV) systems are highly vulnerable to presentation attacks, also called spoofing attacks. Replay is among the simplest attacks to mount -yet difficult to detect reliably. The generalization failure of spoofing countermeasures (CMs) has driven the community to study various alternative deep learning CMs. The majority of them are supervised approaches that learn a human-spoof discriminator. In this paper, we advocate a different, deep generative approach that leverages from powerful unsupervised manifold learning in classification. The potential benefits include the possibility to sample new data, and to obtain insights to the latent features of genuine and spoofed speech. To this end, we propose to use variational autoencoders (VAEs) as an alternative backend for replay attack detection, via three alternative models that differ in their class-conditioning. The first one, similar to the use of Gaussian mixture models (GMMs) in spoof detection, is to train independently two VAEs -one for each class. The second one is to train a single conditional model (C-VAE) by injecting a one-hot class label vector to the encoder and decoder networks. Our final proposal integrates an auxiliary classifier to guide the learning of the latent space. Our experimental results using constant-Q cepstral coefficient (CQCC) features on the ASVspoof 2017 and 2019 physical access subtask datasets indicate that the C-VAE offers substantial improvement in comparison to training two separate VAEs for each class. On the 2019 dataset, the C-VAE outperforms the VAE and the baseline GMM by an absolute 9 -10% in both equal error rate (EER) and tandem detection cost function (t-DCF) metrics. Finally, we propose VAE residuals -the absolute difference of the original input and the reconstruction as features for spoofing detection. The proposed frontend approach augmented with a convolutional neural network classifier demonstrated substantial improvement over the VAE backend use case.

show abstract

Automatic speaker verification systems and spoof detection techniques: review and analysis

Mittal

Dua

2021

Int J Speech Technol

View full text Add to dashboard Cite

Analysing The Predictions Of a CNN-Based Replay Spoofing Detection System

Cited by 21 publications

References 15 publications

Ensemble Models for Spoofing Detection in Automatic Speaker Verification

Ensemble Models for Spoofing Detection in Automatic Speaker Verification

Deep generative variational autoencoding for replay spoof detection in automatic speaker verification

Automatic speaker verification systems and spoof detection techniques: review and analysis

Contact Info

Product

Resources

About