End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention

Tom, Francis; Jain, Mohit; Dey, Prasenjit

doi:10.21437/interspeech.2018-2279

Cited by 71 publications

(53 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…According to the neural networks used, there are several types deep features. For example, light convolutional neural network was used to learn deep feature for the input of log power spectrum of constant-Q transform (CQT) and fast Fourier transform in [14,20,21], deep Siamese that is formed two convolutional neural networks with the input of spectrogram are used to learned to obtain Siamese embedding features in [31], residual network (ResNet) was used to learn deep feature from the input of group delay gram in [19,32,33].…”

Section: Related Workmentioning

confidence: 99%

Discriminative features based on modified log magnitude spectrum for playback speech detection

Yang

Ren³

et al. 2020

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

In order to improve the performance of hand-crafted features to detect playback speech, two discriminative features, constant-Q variance-based octave coefficients and constant-Q mean-based octave coefficients, are proposed for playback speech detection in this work. They rely on our findings that variance-based modified log magnitude spectrum and mean-based modified log magnitude spectrum can enhance the discriminative power between genuine speech and playback speech. Then constant-Q variance-based octave coefficients (constant-Q mean-based octave coefficients) can be obtained by combining variance-based modified log magnitude spectrum (mean-based modified log magnitude spectrum), octave segmentation, and discrete cosine transform. Finally, constant-Q variance-based octave coefficients and constant-Q mean-based octave coefficients are evaluated on ASVspoof 2017 corpus version 2.0 and ASVspoof 2019 physical access, respectively. Experimental results show that variance-based modified log magnitude spectrum and mean-based modified log magnitude spectrum can produce discriminative features toward playback speech. Further results on the two databases show that constant-Q variance-based octave coefficients and constant-Q mean-based octave coefficients can perform better than some common features, such as mel frequency cepstral coefficients and constant-Q cepstral coefficients.

show abstract

Section: Related Workmentioning

confidence: 99%

Discriminative features based on modified log magnitude spectrum for playback speech detection

Yang

Ren³

et al. 2020

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

show abstract

“…1, the DNN framework accepts variable-length feature sequence and produces an utterance-level result from the output unit directly. The network structure here is somewhat similar to that one in [19]. However, there exist two main differences: (i) the input feature sequence is first either truncated or padded to a fixed length along the time axis, and then further resized to a 512×256 "image" before feeding into the DNN in [19].…”

Section: Utterance-level Dnn Frameworkmentioning

confidence: 99%

“…Previous work in [17][18][19] have investigated the efficiency of deep learning approach compared to the GMM classifier. Various kinds of feature representation other than CQCC, such as short-time Fourier transform (STFT) gram [18], group delay gram (GD gram) [19] are also explored and show superior performance. Besides, Cai et al have also demonstrated that suitable data augmentation (DA) can also significantly improve the replay detection system performance [20].…”

Section: Introductionmentioning

confidence: 99%

The DKU Replay Detection System for the ASVspoof 2019 Challenge: On Data Augmentation, Feature Representation, Classification, and Fusion

Cai

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

This paper describes our DKU replay detection system for the ASVspoof 2019 challenge. The goal is to develop spoofing countermeasure for automatic speaker recognition in physical access scenario. We leverage the countermeasure system pipeline from four aspects, including the data augmentation, feature representation, classification, and fusion. First, we introduce an utterance-level deep learning framework for antispoofing. It receives the variable-length feature sequence and outputs the utterance-level scores directly. Based on the framework, we try out various kinds of input feature representations extracted from either the magnitude spectrum or phase spectrum. Besides, we also perform the data augmentation strategy by applying the speed perturbation on the raw waveform. Our best single system employs a residual neural network trained by the speed-perturbed group delay gram. It achieves EER of 1.04% on the development set, as well as EER of 1.08% on the evaluation set. Finally, using the simple average score from several single systems can further improve the performance. EER of 0.24% on the development set and 0.66% on the evaluation set is obtained for our primary system.

show abstract

“…A high-resolution of 2,048 FFT bins was used for all spectrograms in this study. We were inspired by the experiments in Tom et al [6] that attention-based GD-grams significantly outperformed spectrograms. The GD-grams used in the Tom et al [6] are obtained using 2,048 FFT bins, which had higher resolution than spectrograms.…”

Section: Complementary High-resolution Featurementioning

confidence: 99%

“…We were inspired by the experiments in Tom et al [6] that attention-based GD-grams significantly outperformed spectrograms. The GD-grams used in the Tom et al [6] are obtained using 2,048 FFT bins, which had higher resolution than spectrograms. We hypothesized that the difference in resolution could have also conducted a key role to the performance besides the difference of used feature.…”

Section: Complementary High-resolution Featurementioning

confidence: 99%

Replay Attack Detection with Complementary High-Resolution Information Using End-to-End DNN for the ASVspoof 2019 Challenge

Jung¹,

Shim²,

Heo³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

In this study, we concentrate on replacing the process of extracting hand-crafted acoustic feature with end-to-end DNN using complementary high-resolution spectrograms. As a result of advance in audio devices, typical characteristics of a replayed speech based on conventional knowledge alter or diminish in unknown replay configurations. Thus, it has become increasingly difficult to detect spoofed speech with a conventional knowledge-based approach. To detect unrevealed characteristics that reside in a replayed speech, we directly input spectrograms into an end-to-end DNN without knowledge-based intervention. Explorations dealt in this study that differentiates from existing spectrogram-based systems are twofold: complementary information and high-resolution. Spectrograms with different information are explored, and it is shown that additional information such as the phase information can be complementary. High-resolution spectrograms are employed with the assumption that the difference between a bona-fide and a replayed speech exists in the details. Additionally, to verify whether other features are complementary to spectrograms, we also examine raw waveform and an i-vector based system. Experiments conducted on the ASVspoof 2019 physical access challenge show promising results, where t-DCF and equal error rates are 0.0570 and 2.45 % for the evaluation set, respectively.

show abstract

End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention

Cited by 71 publications

References 18 publications

Discriminative features based on modified log magnitude spectrum for playback speech detection

Discriminative features based on modified log magnitude spectrum for playback speech detection

The DKU Replay Detection System for the ASVspoof 2019 Challenge: On Data Augmentation, Feature Representation, Classification, and Fusion

Replay Attack Detection with Complementary High-Resolution Information Using End-to-End DNN for the ASVspoof 2019 Challenge

Contact Info

Product

Resources

About