RW-Resnet: A Novel Speech Anti-Spoofing Model Using Raw Waveform

Ma, Youxuan; Ren, Zongze; Xu, Shugong

doi:10.21437/interspeech.2021-438

Cited by 20 publications

(3 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Fig. 5 shows [13] 0.1000 5.06 FFT-LCNN [13] 0.1028 4.53 LFCC-Siamese CNN [15] 0.0930 3.79 FFT-LCGRNN [7] 0.0776 3.03 RW-Resnet [19] 0.0820 2.98 Ling et al [16] 0.0510 1.87 FFT-L-SENet [38] 0.0368 1.14 AASIST [7] 0.0347 1.13 LPS(F0) (ours) 0.0358 1.21 (b) Primary systems System t-DCF EER% T05 [28] 0.0069 0.22 T45 [13] 0.0510 1.84 T60 [3] 0.0755 2.64 GMM fusion [26] 0.0740 2.92 T24 [28] 0.0953 3.45 T50 [36] 0.1671 3.56 (Imag(L)+Real(H)) + LPS(F0) (ours) 0.0143 0.43 the detailed performance of LPS in different attacks of the evaluation set.…”

Section: Effectiveness Of F0 Subbandmentioning

confidence: 99%

“…This is because for the LFCC-Capsule Fusion System [18], T45 [13], T60 [3] and Ling et al [16] the features are based on the magnitude spectrogram, and for the FFT-L-SENet [38] system, whose features are based on low frequency and magnitude spectrogram, which will lead to loss of information and phase information in high frequency. Although the RW-Resnet [19] and RAWNet2 [27] systems are based on the original waveform without losing speech information, the original waveform is affected by many factors, and it is difficult to effectively distinguish between real and fake speech. In addition, the T05 is obtained from 7 single systems, including 2 ResNet systems, 4 MobileNet systems, and a DenseNet system.…”

Section: Comparison With Other Systemsmentioning

confidence: 99%

See 1 more Smart Citation

Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

Xue¹,

Fan²,

Zhao³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recently, pioneer research works have proposed a large number of acoustic features (log power spectrogram, linear frequency cepstral coefficients, constant Q cepstral coefficients, etc.) for audio deepfake detection, obtaining good performance, and showing that different subbands have different contributions to audio deepfake detection. However, this lacks an explanation of the specific information in the subband, and these features also lose information such as phase. Inspired by the mechanism of synthetic speech, the fundamental frequency (F0) information is used to improve the quality of synthetic speech, while the F0 of synthetic speech is still too average, which differs significantly from that of real speech. It is expected that F0 can be used as important information to discriminate between bonafide and fake speech, while this information cannot be used directly due to the irregular distribution of F0. Insteadly, the frequency band containing most of F0 is selected as the input feature. Meanwhile, to make full use of the phase and full-band information, we also propose to use real and imaginary spectrogram features as complementary input features and model

show abstract

Section: Effectiveness Of F0 Subbandmentioning

confidence: 99%

Section: Comparison With Other Systemsmentioning

confidence: 99%

Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

Xue¹,

Fan²,

Zhao³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In recent years, more and more researchers have attempted to improve the detection capability of ASV systems by building deep learning models. Many studies [9][10][11] have proposed effective network architectures, such as using raw waveforms as input to obtain better representations. However, the robustness of the model is still limited by the insufficiency of the training data.…”

Section: Introductionmentioning

confidence: 99%

Audio Anti-Spoofing Based on Audio Feature Fusion

Zhang

Liu

et al. 2023

Algorithms

View full text Add to dashboard Cite

The rapid development of speech synthesis technology has significantly improved the naturalness and human-likeness of synthetic speech. As the technical barriers for speech synthesis are rapidly lowering, the number of illegal activities such as fraud and extortion is increasing, posing a significant threat to authentication systems, such as automatic speaker verification. This paper proposes an end-to-end speech synthesis detection model based on audio feature fusion in response to the constantly evolving synthesis techniques and to improve the accuracy of detecting synthetic speech. The model uses a pre-trained wav2vec2 model to extract features from raw waveforms and utilizes an audio feature fusion module for back-end classification. The audio feature fusion module aims to improve the model accuracy by adequately utilizing the audio features extracted from the front end and fusing the information from timeframes and feature dimensions. Data augmentation techniques are also used to enhance the performance generalization of the model. The model is trained on the training and development sets of the logical access (LA) dataset of the ASVspoof 2019 Challenge, an international standard, and is tested on the logical access (LA) and deep-fake (DF) evaluation datasets of the ASVspoof 2021 Challenge. The equal error rate (EER) on ASVspoof 2021 LA and ASVspoof 2021 DF are 1.18% and 2.62%, respectively, achieving the best results on the DF dataset.

show abstract