Replay Attack Detection with Complementary High-Resolution Information Using End-to-End DNN for the ASVspoof 2019 Challenge

Jung, Jee-weon; Shim, Hye-jin; Heo, Hee-Soo; Yu, Ha-Jin

doi:10.21437/interspeech.2019-1991

Cited by 31 publications

(12 citation statements)

References 25 publications

(32 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For the ASVspoof 2019 PA scenario, 50 systems were submitted [19]. Many countermeasures used DNNs such as the CNN, light-CNN (LCNN), and residual network (ResNet) as backend systems [20], [21], [22], [23], [24], [25], [26]. For input features, spectrogram and phase information [22], [27], linear frequency cepstral coefficients (LFCC) [18], constant Q cepstral coefficients (CQCC) [17], Mel-frequency cepstral coefficients (MFCC), inverted MFCC (IMFCC) [28], and rectangular filter cepstral coefficients (RFCC) [29] were adopted.…”

Section: Asvspoof 2019 Resultsmentioning

confidence: 99%

“…A lot of countermeasures using DNN have been proposed for ASVspoof 2019 [19]. One of these countermeasures used highresolution spectrograms as input features, and CNN and gated recurrent unit (GRU) were used as a classifier, and this countermeasure was named CNN-GRU [20]. The DNN architecture of CNN-GRU is composed of convolutional layers, pooling layers, ResNet layers, and a GRU layer.…”

Section: Cnn-gru For Rad Methodsmentioning

confidence: 99%

“…The authors of Ref. [20] provided the software for a single ResNet system on GitHub. For training the ResNet system, the ASVspoof 2019 database was used.…”

Section: Db1mentioning

confidence: 99%

See 2 more Smart Citations

Replay Attack Detection Based on Spatial and Spectral Features of Stereo Signal

Yaguchi

Shiota

Kiya

2021

Journal of Information Processing

View full text Add to dashboard Cite

In this paper, we propose a replay attack detection (RAD) method that uses spatial and spectral features of a stereo signal. To distinguish genuine and replayed utterance, we focus on non-speech segments, in which a human does not emit sound, but a loudspeaker for replay attack might emit some recorded noise or its electromagnetic noise. The generalized cross-correlation (GCC) based spatial features capture this difference. To improve the robustness against the variety of recording environments, we combine the spatial features with spectral features. In particular, we fuse the output scores of GCC-based and spectral feature-based methods. In experiments, we confirm the effectiveness of the combination of spatial and spectral features.

show abstract

Section: Asvspoof 2019 Resultsmentioning

confidence: 99%

Section: Cnn-gru For Rad Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Replay Attack Detection Based on Spatial and Spectral Features of Stereo Signal

Yaguchi

Shiota

Kiya

2021

Journal of Information Processing

View full text Add to dashboard Cite

show abstract

“…However, SV systems are known to be vulnerable to various presentation attacks, such as replay attacks, voice conversion, and speech synthesis. These vulnerabilities have inspired research into presentation attack detection (PAD), which classifies given utterances as spoofed or not spoofed [6][7][8], where many DNN-based systems have achieved promising results [9][10][11]. Table 1 demonstrates the vulnerability of conventional SV systems when faced with presentation attacks.…”

Section: Introductionmentioning

confidence: 99%

Integrated Replay Spoofing-Aware Text-Independent Speaker Verification

et al. 2020

Self Cite

View full text Add to dashboard Cite

A number of studies have successfully developed speaker verification or presentation attack detection systems. However, studies integrating the two tasks remain in the preliminary stages. In this paper, we propose two approaches for building an integrated system of speaker verification and presentation attack detection: an end-to-end monolithic approach and a back-end modular approach. The first approach simultaneously trains speaker identification, presentation attack detection, and the integrated system using multi-task learning using a common feature. However, through experiments, we hypothesize that the information required for performing speaker verification and presentation attack detection might differ because speaker verification systems try to remove device-specific information from speaker embeddings, while presentation attack detection systems exploit such information. Therefore, we propose a back-end modular approach using a separate deep neural network (DNN) for speaker verification and presentation attack detection. This approach has thee input components: two speaker embeddings (for enrollment and test each) and prediction of presentation attacks. Experiments are conducted using the ASVspoof 2017-v2 dataset, which includes official trials on the integration of speaker verification and presentation attack detection. The proposed back-end approach demonstrates a relative improvement of 21.77% in terms of the equal error rate for integrated trials compared to a conventional speaker verification system.

show abstract

“…Comparison of proposed system with existing systems ✔ Indicates that a particular attack is addressed and ✖ indicates that a particular attack is not addressed attacks[27]. Jung et al[46] has trained a Deep Neural Network Model with 7 spectrograms, i-vectors and raw waveforms only for replay attack detection. Table…”

mentioning

confidence: 99%

Static–dynamic features and hybrid deep learning models based spoof detection system for ASV

Mittal

Dua

2021

Complex Intell. Syst.

View full text Add to dashboard Cite

Detection of spoof is essential for improving the performance of current scenario of Automatic Speaker Verification (ASV) systems. Empowerment to both frontend and backend parts can build the robust ASV systems. First, this paper discuses performance comparison of static and static–dynamic Constant Q Cepstral Coefficients (CQCC) frontend features by using Long Short Term Memory (LSTM) with Time Distributed Wrappers model at the backend. Second, it performs comparative analysis of ASV systems built using three deep learning models LSTM with Time Distributed Wrappers, LSTM and Convolutional Neural Network at backend and using static–dynamic CQCC features at frontend. Third, it discusses implementation of two spoof detection systems for ASV by using same static–dynamic CQCC features at frontend and different combination of deep learning models at backend. Out of these two, the first one is a voting protocol based two-level spoof detection system that uses CNN, LSTM model at first level and LSTM with Time Distributed Wrappers model at second level. The second one is a two-level spoof detection system with user identification and verification protocol, which uses LSTM model for user identification at first level and LSTM with Time Distributed Wrappers for verification at the second level. For implementing the proposed work, a variation in ASVspoof 2019 dataset has been used to introduce all types of spoofing attacks such as Speech Synthesis (SS), Voice Conversion (VC) and replay in single set of dataset. The results show that, at frontend, static–dynamic CQCC feature outperform static CQCC features and at the backend, hybrid combination of deep learning models increases accuracy of spoof detection systems.

show abstract

Replay Attack Detection with Complementary High-Resolution Information Using End-to-End DNN for the ASVspoof 2019 Challenge

Cited by 31 publications

References 25 publications

Replay Attack Detection Based on Spatial and Spectral Features of Stereo Signal

Replay Attack Detection Based on Spatial and Spectral Features of Stereo Signal

Integrated Replay Spoofing-Aware Text-Independent Speaker Verification

Static–dynamic features and hybrid deep learning models based spoof detection system for ASV

Contact Info

Product

Resources

About