Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-2279
|View full text |Cite
|
Sign up to set email alerts
|

End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention

Abstract: With automatic speaker verification (ASV) systems becoming increasingly popular, the development of robust countermeasures against spoofing is needed. Replay attacks pose a significant threat to the reliability of ASV systems because of the relative difficulty in detecting replayed speech and the ease with which such attacks can be mounted. In this paper, we propose an end-to-end deep learning framework for audio replay attack detection. Our proposed approach uses a novel visual attention mechanism on time-fre… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
51
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 71 publications
(53 citation statements)
references
References 18 publications
0
51
0
Order By: Relevance
“…According to the neural networks used, there are several types deep features. For example, light convolutional neural network was used to learn deep feature for the input of log power spectrum of constant-Q transform (CQT) and fast Fourier transform in [14,20,21], deep Siamese that is formed two convolutional neural networks with the input of spectrogram are used to learned to obtain Siamese embedding features in [31], residual network (ResNet) was used to learn deep feature from the input of group delay gram in [19,32,33].…”
Section: Related Workmentioning
confidence: 99%
“…According to the neural networks used, there are several types deep features. For example, light convolutional neural network was used to learn deep feature for the input of log power spectrum of constant-Q transform (CQT) and fast Fourier transform in [14,20,21], deep Siamese that is formed two convolutional neural networks with the input of spectrogram are used to learned to obtain Siamese embedding features in [31], residual network (ResNet) was used to learn deep feature from the input of group delay gram in [19,32,33].…”
Section: Related Workmentioning
confidence: 99%
“…1, the DNN framework accepts variable-length feature sequence and produces an utterance-level result from the output unit directly. The network structure here is somewhat similar to that one in [19]. However, there exist two main differences: (i) the input feature sequence is first either truncated or padded to a fixed length along the time axis, and then further resized to a 512×256 "image" before feeding into the DNN in [19].…”
Section: Utterance-level Dnn Frameworkmentioning
confidence: 99%
“…Previous work in [17][18][19] have investigated the efficiency of deep learning approach compared to the GMM classifier. Various kinds of feature representation other than CQCC, such as short-time Fourier transform (STFT) gram [18], group delay gram (GD gram) [19] are also explored and show superior performance. Besides, Cai et al have also demonstrated that suitable data augmentation (DA) can also significantly improve the replay detection system performance [20].…”
Section: Introductionmentioning
confidence: 99%
“…A high-resolution of 2,048 FFT bins was used for all spectrograms in this study. We were inspired by the experiments in Tom et al [6] that attention-based GD-grams significantly outperformed spectrograms. The GD-grams used in the Tom et al [6] are obtained using 2,048 FFT bins, which had higher resolution than spectrograms.…”
Section: Complementary High-resolution Featurementioning
confidence: 99%
“…We were inspired by the experiments in Tom et al [6] that attention-based GD-grams significantly outperformed spectrograms. The GD-grams used in the Tom et al [6] are obtained using 2,048 FFT bins, which had higher resolution than spectrograms. We hypothesized that the difference in resolution could have also conducted a key role to the performance besides the difference of used feature.…”
Section: Complementary High-resolution Featurementioning
confidence: 99%