Synthetic speech detection through short-term and long-term prediction traces

Borrelli, Clara; Bestagini, Paolo; Antonacci, Fabio; Sarti, Augusto; Tubaro, Stefano

doi:10.1186/s13635-021-00116-3

Cited by 46 publications

(17 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, we also found that Discriminator is prone to being deceived by the generated samples from other speech generative models as Discriminator was not jointly trained with those generative models. Therefore, we expect more robust synthesized speech detection algorithms to be developed in the future such as [48,40,9,6].…”

Section: Conclusion and Discussionmentioning

confidence: 99%

Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

Choi¹,

Lee²,

Kim³

et al. 2021

Preprint

View full text Add to dashboard Cite

We present a neural analysis and synthesis (NANSY) framework that can manipulate voice, pitch, and speed of an arbitrary speech signal. Most of the previous works have focused on using information bottleneck to disentangle analysis features for controllable synthesis, which usually results in poor reconstruction quality. We address this issue by proposing a novel training strategy based on information perturbation. The idea is to perturb information in the original input signal (e.g., formant, pitch, and frequency response), thereby letting synthesis networks selectively take essential attributes to reconstruct the input signal. Because NANSY does not need any bottleneck structures, it enjoys both high reconstruction quality and controllability. Furthermore, NANSY does not require any labels associated with speech data such as text and speaker information, but rather uses a new set of analysis features, i.e., wav2vec feature and newly proposed pitch feature, Yingram, which allows for fully self-supervised training. Taking advantage of fully selfsupervised training, NANSY can be easily extended to a multilingual setting by simply training it with a multilingual dataset. The experiments show that NANSY can achieve significant improvement in performance in several applications such as zero-shot voice conversion, pitch shift, and time-scale modification 1 .

show abstract

Section: Conclusion and Discussionmentioning

confidence: 99%

Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

Choi¹,

Lee²,

Kim³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…It is observed that QSVM beats other traditional approaches by 97.56% accuracy and has only a 2.43% misclassification rate. Similarly, Borrelli et al [109] created an SVM model using RF to classify artificial voices using a novel audio component known as short-term long-term (STLT). The Automatic Speaker Verification (ASV) spoof 2019 challenge dataset was used to train the models.…”

Section: Deepfake Audio Detection Techniquesmentioning

confidence: 99%

A Review of Image Processing Techniques for Deepfakes

Shahzad

Rustam

Flores

et al. 2022

Sensors

View full text Add to dashboard Cite

Deep learning is used to address a wide range of challenging issues including large data analysis, image processing, object detection, and autonomous control. In the same way, deep learning techniques are also used to develop software and techniques that pose a danger to privacy, democracy, and national security. Fake content in the form of images and videos using digital manipulation with artificial intelligence (AI) approaches has become widespread during the past few years. Deepfakes, in the form of audio, images, and videos, have become a major concern during the past few years. Complemented by artificial intelligence, deepfakes swap the face of one person with the other and generate hyper-realistic videos. Accompanying the speed of social media, deepfakes can immediately reach millions of people and can be very dangerous to make fake news, hoaxes, and fraud. Besides the well-known movie stars, politicians have been victims of deepfakes in the past, especially US presidents Barak Obama and Donald Trump, however, the public at large can be the target of deepfakes. To overcome the challenge of deepfake identification and mitigate its impact, large efforts have been carried out to devise novel methods to detect face manipulation. This study also discusses how to counter the threats from deepfake technology and alleviate its impact. The outcomes recommend that despite a serious threat to society, business, and political institutions, they can be combated through appropriate policies, regulation, individual actions, training, and education. In addition, the evolution of technology is desired for deepfake identification, content authentication, and deepfake prevention. Different studies have performed deepfake detection using machine learning and deep learning techniques such as support vector machine, random forest, multilayer perceptron, k-nearest neighbors, convolutional neural networks with and without long short-term memory, and other similar models. This study aims to highlight the recent research in deepfake images and video detection, such as deepfake creation, various detection algorithms on self-made datasets, and existing benchmark datasets.

show abstract

“…In the audio field, [9] feeds linear filter banks into a Resnet to generate embeddings used as input of a neural network classifier, and in [10] long-term features are used to discriminate fake and real audio tracks. Recently [11] detected for audio deepfakes based on long-term and short-term predictor features, while [12] exploits the traces left by time scaling to discriminate fake audio signals.…”

Section: Introductionmentioning

confidence: 99%

Deepfake Speech Detection Through Emotion Recognition: A Semantic Approach

Conti

Salvi

Borrelli

et al. 2022

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

In recent years, audio and video deepfake technology has advanced relentlessly, severely impacting people's reputation and reliability. Several factors have facilitated the growing deepfake threat. On the one hand, the hyper-connected society of social and mass media enables the spread of multimedia content worldwide in real-time, facilitating the dissemination of counterfeit material. On the other hand, neural network-based techniques have made deepfakes easier to produce and difficult to detect, showing that the analysis of low-level features is no longer sufficient for the task. This situation makes it crucial to design systems that allow detecting deepfakes at both video and audio levels. In this paper, we propose a new audio spoofing detection system leveraging emotional features. The rationale behind the proposed method is that audio deepfake techniques cannot correctly synthesize natural emotional behavior. Therefore, we feed our deepfake detector with high-level features obtained from a state-of-the-art Speech Emotion Recognition (SER) system. As the used descriptors capture semantic audio information, the proposed system proves robust in cross-dataset scenarios outperforming the considered baseline on multiple datasets.

show abstract

Synthetic speech detection through short-term and long-term prediction traces

Cited by 46 publications

References 31 publications

Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

A Review of Image Processing Techniques for Deepfakes

Deepfake Speech Detection Through Emotion Recognition: A Semantic Approach

Contact Info

Product

Resources

About