Warning: Humans cannot reliably detect speech deepfakes

Kimberly, T.; Bray, Sergi D; Davies, T.; Griffin, Lewis D.

doi:10.1371/journal.pone.0285333

Cited by 12 publications

(6 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This permits for an extremely high accuracy in voice clones in a similar domain to the training data but new advancements and subtle changes in these obscure features could soon make these prediction models obsolete. Indeed, when a high-accuracy prediction model was tested on new, out-of-domain voice clones in a recent study, the prediction accuracy was abysmal (AUC is approximately 25%) [ 10 ]. We aimed to evaluate the use of perceptual features in current and future model implementations by testing model performance on a completely new generator.…”

Section: Discussionmentioning

confidence: 99%

“…For example, the previously mentioned tool that achieved 100% accuracy was trained and tested on a data set of deepfakes generated in 2019, which are of much lower quality than the level of deepfakes available in 2023 [ 8 ]. Furthermore, recent work has shown that out-of-domain voice clone detectors (ie, voice detectors applied outside of the data set in which they were applied) had extremely low performance, obtaining an area under the receiver operator curve (AUC) of 25% [ 10 ]. A more robust detection method might involve searching for the absence of biological features in the cloned voice, rather than the presence of digital features [ 11 ].…”

Section: Introductionmentioning

confidence: 99%

“…This may result in subtle but detectable differences in the way pauses are present in authentic versus cloned audio. Indeed, when humans were asked to distinguish between audio deepfakes and authentic voices, one of the primary justifications for a fake audio classification was unnatural pauses in the recordings [ 10 ]. Furthermore, when these features were integrated into a classification regime, a moderate accuracy (approximately 85%) was achieved when analyzing deepfakes by perceptual features such as the amplitude of speech and pauses within a recording [ 12 ].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Investigation of Deepfake Voice Detection Using Speech Pause Patterns: Algorithm Development and Validation

Kulangareth,

Kaufman,

Oreskovic

et al. 2024

JMIR Biomed Eng

View full text Add to dashboard Cite

Background The digital era has witnessed an escalating dependence on digital platforms for news and information, coupled with the advent of “deepfake” technology. Deepfakes, leveraging deep learning models on extensive data sets of voice recordings and images, pose substantial threats to media authenticity, potentially leading to unethical misuse such as impersonation and the dissemination of false information. Objective To counteract this challenge, this study aims to introduce the concept of innate biological processes to discern between authentic human voices and cloned voices. We propose that the presence or absence of certain perceptual features, such as pauses in speech, can effectively distinguish between cloned and authentic audio. Methods A total of 49 adult participants representing diverse ethnic backgrounds and accents were recruited. Each participant contributed voice samples for the training of up to 3 distinct voice cloning text-to-speech models and 3 control paragraphs. Subsequently, the cloning models generated synthetic versions of the control paragraphs, resulting in a data set consisting of up to 9 cloned audio samples and 3 control samples per participant. We analyzed the speech pauses caused by biological actions such as respiration, swallowing, and cognitive processes. Five audio features corresponding to speech pause profiles were calculated. Differences between authentic and cloned audio for these features were assessed, and 5 classical machine learning algorithms were implemented using these features to create a prediction model. The generalization capability of the optimal model was evaluated through testing on unseen data, incorporating a model-naive generator, a model-naive paragraph, and model-naive participants. Results Cloned audio exhibited significantly increased time between pauses (P<.001), decreased variation in speech segment length (P=.003), increased overall proportion of time speaking (P=.04), and decreased rates of micro- and macropauses in speech (both P=.01). Five machine learning models were implemented using these features, with the AdaBoost model demonstrating the highest performance, achieving a 5-fold cross-validation balanced accuracy of 0.81 (SD 0.05). Other models included support vector machine (balanced accuracy 0.79, SD 0.03), random forest (balanced accuracy 0.78, SD 0.04), logistic regression, and decision tree (balanced accuracies 0.76, SD 0.10 and 0.72, SD 0.06). When evaluating the optimal AdaBoost model, it achieved an overall test accuracy of 0.79 when predicting unseen data. Conclusions The incorporation of perceptual, biological features into machine learning models demonstrates promising results in distinguishing between authentic human voices and cloned audio.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Investigation of Deepfake Voice Detection Using Speech Pause Patterns: Algorithm Development and Validation

Kulangareth,

Kaufman,

Oreskovic

et al. 2024

JMIR Biomed Eng

View full text Add to dashboard Cite

show abstract

“…An MIT Exploration of Generative AI • From Novel Chemicals to Opera Labeling AI-Generated Content: Promises, Perils, and Future Directions 3 (Farid 2022;Köbis, Doležalová, and Soraperra 2021;Mai et al 2023), and these issues will undoubtedly worsen as technology continues to improve and evolve (Thompson and Hsu 2023;Vynck 2023).…”

Section: Listen To This Articlementioning

confidence: 99%

Labeling AI-Generated Content: Promises, Perils, and Future Directions

Wittenberg,

Epstein,

Berinsky

et al. 2024

An MIT Exploration of Generative AI

View full text Add to dashboard Cite

Labeling is a commonly proposed strategy for reducing the risks of generative artificial intelligence (AI). This approach involves applying visible content warnings to alert users to the presence of AI-generated media online (e.g., on social media, news sites, or search engines). Although there is little direct evidence regarding the effectiveness of labeling AI-generated media, a large volume of academic work suggests that warning labels can substantially reduce people's belief in and sharing of content debunked by professional factcheckers. Thus, there is reason to believe that labeling could help inform members of the public about AIgenerated media. In this paper, we provide a framework for helping policymakers, platforms, and practitioners weigh various factors related to the labeling of AI-generated content online. First, we argue that, before developing labeling programs and policies related to generative AI, stakeholders must establish the objective(s) that labeling is intended to accomplish. Here, we distinguish two such goals: (1) communicating to viewers the process by which a given piece of content was created or edited (i.e., with or without using generative AI tools) versus ( 2) diminishing the likelihood that content misleads or deceives its viewers (a result that does not necessarily depend on whether the content was created using AI). Next, we summarize results from two largescale experiments demonstrating that labeling can, under certain conditions, meaningfully decrease individuals' likelihood of believing and engaging with misleading, AI-generated images. Finally, we highlight several important issues and challenges that must be considered when designing, evaluating, and implementing labeling policies and programs, including the need to (1) determine what types of content to label and how to reliably identify this content at scale, (2) consider the inferences viewers will draw about both labeled and unlabeled content, and (3) evaluate the efficacy of labeling approaches across contexts.

show abstract

“…Video, of course, relies on both audio and visual channels, and the role of audio itself should not be underestimated. For example, humans cannot reliably detect speech deepfakes [18]. The above-mentioned body of research demonstrates the need to examine dis-/ misinformation from a multimodal perspective.…”

Section: Introductionmentioning

confidence: 99%

Multimodal analysis of disinformation and misinformation

Wilson,

Wilkes,

Teramoto

et al. 2023

R. Soc. open sci.

View full text Add to dashboard Cite

The use of disinformation and misinformation campaigns in the media has attracted much attention from academics and policy-makers. Multimodal analysis or the analysis of two or more semiotic systems—language, gestures, images, sounds, among others—in their interrelation and interaction is essential to understanding dis-/misinformation efforts because most human communication goes beyond just words. There is a confluence of many disciplines (e.g. computer science, linguistics, political science, communication studies) that are developing methods and analytical models of multimodal communication. This literature review brings research strands from these disciplines together, providing a map of the multi- and interdisciplinary landscape for multimodal analysis of dis-/misinformation. It records the substantial growth starting from the second quarter of 2020—the start of the COVID-19 epidemic in Western Europe—in the number of studies on multimodal dis-/misinformation coming from the field of computer science. The review examines that category of studies in more detail. Finally, the review identifies gaps in multimodal research on dis-/misinformation and suggests ways to bridge these gaps including future cross-disciplinary research directions. Our review provides scholars from different disciplines working on dis-/misinformation with a much needed bird's-eye view of the rapidly emerging research of multimodal dis-/misinformation.

show abstract

Warning: Humans cannot reliably detect speech deepfakes

Cited by 12 publications

References 36 publications

Investigation of Deepfake Voice Detection Using Speech Pause Patterns: Algorithm Development and Validation

Investigation of Deepfake Voice Detection Using Speech Pause Patterns: Algorithm Development and Validation

Labeling AI-Generated Content: Promises, Perils, and Future Directions

Multimodal analysis of disinformation and misinformation

Contact Info

Product

Resources

About