The ASVspoof Dataset is one of the most established datasets for training and benchmarking systems designed for the detection of spoofed audio and audio deepfakes. However, we observe an uneven distribution of silence length in dataset's training and test data, which hints at the target label: Bona-fide instances tend to have significantly longer leading + trailing silences than spoofed instances. This could be problematic, since a model may learn to only, or at least partially, base its decision on the length of the silence (similar to the issue with the Pascal VOC 2007 dataset, where all images of horses also contained a specific watermark [1]). In this paper, we explore this phenomenon in depth. We train a number of networks on only a) the length of the leading silence and b) with and without leading + trailing silence. Results show that models trained on only the length of the leading silence perform suspiciously well: They achieve up to 85% percent accuracy and an equal error rate (EER) of 0.15 on the 'eval' split of the data. Conversely, when training strong models on the full audio files, we observe that trimming silence during preprocessing dramatically worsens performance (EER increases from 0.03 to 0.15). This could indicate that previous work may, in part, have learned only to classify targets based on the length of silence. Consequently, it could mean that spoofing detection may not be as advanced as previous high-scores have led to believe. We hope that by sharing these results, the ASV community can further evaluate this phenomenon.
Recent research has highlighted a key issue in speech deepfake detection: models trained on one set of deepfakes perform poorly on others. The question arises: is this due to the continuously improving quality of text-to-speech (TTS) models, i.e., are newer DeepFakes just 'harder' to detect? Or, is it because deepfakes generated with one model are fundamentally different to those generated using another model? We answer this question by decomposing the performance gap between indomain and out-of-domain test data into 'hardness' and 'difference' components. Experiments performed using ASVspoof databases indicate that the hardness component is practically negligible, with the performance gap being attributed primarily to the difference component. This has direct implications for real-world deepfake detection, highlighting that merely increasing model capacity, the currently-dominant research trend, may not effectively address the generalization challenge.
Current text-to-speech algorithms produce realistic fakes of human voices, making deepfake detection a much-needed area of research. While researchers have presented various techniques for detecting audio spoofs, it is often unclear exactly why these architectures are successful: Preprocessing steps, hyperparameter settings, and the degree of fine-tuning are not consistent across related work. Which factors contribute to success, and which are accidental?In this work, we address this problem: We systematize audio spoofing detection by re-implementing and uniformly evaluating architectures from related work. We identify overarching features for successful audio deepfake detection, such as using cqtspec or logspec features instead of melspec features, which improves performance by 37% EER on average, all other factors constant.Additionally, we evaluate generalization capabilities: We collect and publish a new dataset consisting of 37.9 hours of found audio recordings of celebrities and politicians, of which 17.2 hours are deepfakes. We find that related work performs poorly on such real-world data (performance degradation of up to one thousand percent). This may suggest that the community has tailored its solutions too closely to the prevailing ASVSpoof benchmark and that deepfakes are much harder to detect outside the lab than previously thought.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
hi@scite.ai
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.