Speech is Silver, Silence is Golden: What do ASVspoof-trained Models Really Learn?

Müller, Nicolas M.; Dieckmann, Franziska; Pavel, Czempin,; Canals, Roman; Böttinger, Konstantin; Williams, Jennifer

doi:10.21437/asvspoof.2021-9

Cited by 33 publications

(16 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…al. [13] shows that a bias can be found in the distribution of the lengths of leading and trailing silences in bonafide and synthetic speeches. Authors argue that most detectors are just probably discriminating between forged and bonafide samples by using this information.…”

Section: Dataset Preparation and Experimental Setupmentioning

confidence: 94%

“…Authors argue that most detectors are just probably discriminating between forged and bonafide samples by using this information. In order to bypass this problem, silent parts were removed from the signal, as suggested in [13] but this led to a big loss in performance.…”

Section: Dataset Preparation and Experimental Setupmentioning

confidence: 99%

“…From this filtering, only a few samples (less than 1%) were then removed since the number of silent values was not enough to obtain meaningful statistics. Arguably, this is not an issue since as shown in [13], the very low amount of silence in the audio track allows an easy detection of synthetic audio samples. Moreover, computing FD statistics on a limited amount of signal windows would lead to highly irregular statistics: this implies strong divergences/distances with respect to Benford's law (and therefore, a correct classification).…”

Section: Dataset Preparation and Experimental Setupmentioning

confidence: 99%

“…Although these solutions aim at detecting the peculiar characteristics of a fake audio signal, more recent works [13] have highlighted how synthetic speech algorithms prove to be effective in spoken parts but fail in generating realistic silence. The work by Muller et al shows that the length of trailing silenced parts 1 in synthetic speech samples from ASVSpoof dataset [14] prove to have different statistics with respect to bonafide samples.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

The Sound of Silence: Efficiency of First Digit Features in Synthetic Audio Detection

Mari¹,

Latora²,

Milani³

2022

Preprint

View full text Add to dashboard Cite

The recent integration of generative neural strategies and audio processing techniques have fostered the widespread of synthetic speech synthesis or transformation algorithms. This capability proves to be harmful in many legal and informative processes (news, biometric authentication, audio evidence in courts, etc.). Thus, the development of efficient detection algorithms is both crucial and challenging due to the heterogeneity of forgery techniques.This work investigates the discriminative role of silenced parts in synthetic speech detection and shows how first digit statistics extracted from MFCC coefficients can efficiently enable a robust detection. The proposed procedure is computationally-lightweight and effective on many different algorithms since it does not rely on large neural detection architecture and obtains an accuracy above 90% in most of the classes of the ASVSpoof dataset.

show abstract

Section: Dataset Preparation and Experimental Setupmentioning

confidence: 94%

Section: Dataset Preparation and Experimental Setupmentioning

confidence: 99%

Section: Dataset Preparation and Experimental Setupmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

The Sound of Silence: Efficiency of First Digit Features in Synthetic Audio Detection

Mari¹,

Latora²,

Milani³

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…attacks disjoint from the attacks seen in training. However, the test audios share some specific characteristics [26] with the training data, which is why model generalization cannot be judged using the 'eval' split of ASVspoof 2019 alone. This motivates the use of our proposed 'in-the-wild' dataset, c.f.…”

Section: Train and Evaluation Data Splitsmentioning

confidence: 99%

Does Audio Deepfake Detection Generalize?

Müller¹,

Pavel²,

Dieckmann³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Current text-to-speech algorithms produce realistic fakes of human voices, making deepfake detection a much-needed area of research. While researchers have presented various techniques for detecting audio spoofs, it is often unclear exactly why these architectures are successful: Preprocessing steps, hyperparameter settings, and the degree of fine-tuning are not consistent across related work. Which factors contribute to success, and which are accidental?In this work, we address this problem: We systematize audio spoofing detection by re-implementing and uniformly evaluating architectures from related work. We identify overarching features for successful audio deepfake detection, such as using cqtspec or logspec features instead of melspec features, which improves performance by 37% EER on average, all other factors constant.Additionally, we evaluate generalization capabilities: We collect and publish a new dataset consisting of 37.9 hours of found audio recordings of celebrities and politicians, of which 17.2 hours are deepfakes. We find that related work performs poorly on such real-world data (performance degradation of up to one thousand percent). This may suggest that the community has tailored its solutions too closely to the prevailing ASVSpoof benchmark and that deepfakes are much harder to detect outside the lab than previously thought.

show abstract

Differential convolutional network for noise mask estimation

DİŞKEN

2023

Applied Acoustics

View full text Add to dashboard Cite

Speech is Silver, Silence is Golden: What do ASVspoof-trained Models Really Learn?

Cited by 33 publications

References 8 publications

The Sound of Silence: Efficiency of First Digit Features in Synthetic Audio Detection

The Sound of Silence: Efficiency of First Digit Features in Synthetic Audio Detection

Does Audio Deepfake Detection Generalize?

Differential convolutional network for noise mask estimation

Contact Info

Product

Resources

About