2021
DOI: 10.48550/arxiv.2107.12710
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

End-to-End Spectro-Temporal Graph Attention Networks for Speaker Verification Anti-Spoofing and Speech Deepfake Detection

Abstract: Artefacts that serve to distinguish bona fide speech from spoofed or deepfake speech are known to reside in specific subbands and temporal segments. Various approaches can be used to capture and model such artefacts, however, none works well across a spectrum of diverse spoofing attacks. Reliable detection then often depends upon the fusion of multiple detection systems, each tuned to detect different forms of attack. In this paper we show that better performance can be achieved when the fusion is performed wi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
15
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 12 publications
(17 citation statements)
references
References 54 publications
(49 reference statements)
0
15
0
Order By: Relevance
“…For the ASV models, we use Resnet34 [38], ECAPA-TDNN [35] and MFA-Conformer [39]. For the countermeasure models, we use AASIST [17], AASIST-L, and RawGAT-ST [25], where AASIST-L is a light version of AASIST. The fusion model in Figure 1 is trained by Adam optimizer with an initial learning rate of 0.0001.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…For the ASV models, we use Resnet34 [38], ECAPA-TDNN [35] and MFA-Conformer [39]. For the countermeasure models, we use AASIST [17], AASIST-L, and RawGAT-ST [25], where AASIST-L is a light version of AASIST. The fusion model in Figure 1 is trained by Adam optimizer with an initial learning rate of 0.0001.…”
Section: Methodsmentioning
confidence: 99%
“…The current solutions leverage end-toend deep neural networks (DNNs) [12,13], trying to distinguish artifacts and unnatural cues of spoofing speech from bona fide speech. And thanks to a series of challenges and datasets [1][2][3][4], many novel techniques were introduced to achieve promising CM performances [12][13][14][15][16][17][18][19][20][21][22][23][24][25][26][27][28].…”
Section: Introductionmentioning
confidence: 99%
“…We found that a number of researches have investigated various acoustic features to demonstrate their robustness to PA and LA with the state-of-the-art backend neural networks including: ResNet [13][14][15][16][17], Res2Net [18,19], and graph attention network [20,21]. The representative acoustic features for SSD are: log-power (or magnitude) discrete Fourier transform (DFT), constant-Q transform (CQT), and linear frequency cepstral coefficients (LFCC), which are the magnitude features in the frequency domain [16,18,[22][23][24].…”
Section: Introductionmentioning
confidence: 99%
“…These end-toend models can also be viewed as a category of methods that replace the aforementioned feature extraction with neural nets with trainable parameters. Several studies utilized sinc convolution-based network [20] as a front-end of anti-spoofing model [17,21].…”
Section: Introductionmentioning
confidence: 99%