Text-based editing of talking-head video

Fried, Ohad; Tewari, Ayush; Zollhöfer, Michael; Finkelstein, Adam; Shechtman, Eli; Goldman, Dan B; Genova, Kyle; Jin, Zeyu; Theobalt, Christian; Agrawala, Maneesh

doi:10.1145/3306346.3323028

Cited by 238 publications

(153 citation statements)

References 67 publications

Supporting

Mentioning

140

Contrasting

Unclassified

Order By: Relevance

“…Not only visual part, Suwajanakorn et al [9] presented a method for learning the mapping between speech and lip movements in which speech can also be synthesized, enabling creation of a full-function spoof video. Fried et al [34] demonstrated that speech can be easily modified in any video in accordance with the intention of the manipulator while maintaining a seamless audio-visual flow. Averbuch-Elor et al [8] addressed a different problem -converting still portraits into motion pictures expressing various emotions.…”

Section: Face Manipulationmentioning

confidence: 99%

Capsule-forensics: Using Capsule Networks to Detect Forged Images and Videos

Nguyen

Yamagishi

Echizen

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

504

224

View full text Add to dashboard Cite

The revolution in computer hardware, especially in graphics processing units and tensor processing units, has enabled significant advances in computer graphics and artificial intelligence algorithms. In addition to their many beneficial applications in daily life and business, computergenerated/manipulated images and videos can be used for malicious purposes that violate security systems, privacy, and social trust. The deepfake phenomenon and its variations enable a normal user to use his or her personal computer to easily create fake videos of anybody from a short real online video. Several countermeasures have been introduced to deal with attacks using such videos. However, most of them are targeted at certain domains and are ineffective when applied to other domains or new attacks. In this paper, we introduce a capsule network that can detect various kinds of attacks, from presentation attacks using printed images and replayed videos to attacks using fake videos created using deep learning. It uses many fewer parameters than traditional convolutional neural networks with similar performance. Moreover, we explain, for the first time ever in the literature, the theory behind the application of capsule networks to the forensics problem through detailed analysis and visualization.

show abstract

Section: Face Manipulationmentioning

confidence: 99%

Capsule-forensics: Using Capsule Networks to Detect Forged Images and Videos

Nguyen

Yamagishi

Echizen

2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

504

224

View full text Add to dashboard Cite

show abstract

“…Despite those technical solutions to detect synthetic media and approaches to educate humans on detecting machine manipulated media (Groh et al 2019), a further, quite strict idea is to limit the availability of trained generative models. Against this background, it is astounding how unquestioningly papers have been published in recent years, in which leap innovations in the generation of fake media, especially videos, are described-although many research groups, for instance, the one behind Face2Face, did not release their code (Fried et al 2019;Ovadya and Whittlestone 2019;Thies et al 2015Thies et al , 2016Thies et al , 2018Thies et al , 2019. Synthetic videos, no matter if they are generated through Face2Face, DeepFakes, FaceSwap or NeuralTextures, can have all sorts of negative consequences, from harm to individuals, national security, to the economy and democracy (Chesney and Citron 2018).…”

Section: Synthetic Mediamentioning

confidence: 99%

Forbidden knowledge in machine learning reflections on the limits of research and publication

Hagendorff

2020

AI & Soc

View full text Add to dashboard Cite

Certain research strands can yield "forbidden knowledge". This term refers to knowledge that is considered too sensitive, dangerous or taboo to be produced or shared. Discourses about such publication restrictions are already entrenched in scientific fields like IT security, synthetic biology or nuclear physics research. This paper makes the case for transferring this discourse to machine learning research. Some machine learning applications can very easily be misused and unfold harmful consequences, for instance, with regard to generative video or text synthesis, personality analysis, behavior manipulation, software vulnerability detection and the like. Up till now, the machine learning research community embraces the idea of open access. However, this is opposed to precautionary efforts to prevent the malicious use of machine learning applications. Information about or from such applications may, if improperly disclosed, cause harm to people, organizations or whole societies. Hence, the goal of this work is to outline deliberations on how to deal with questions concerning the dissemination of such information. It proposes a tentative ethical framework for the machine learning community on how to deal with forbidden knowledge and dual-use applications.

show abstract

“…Prior transcript-based audio editing tools use time-aligned text transcripts of spoken audio to automatically group similar sentences, highlight repeated words, and maintain synchronization between multiple speakers [33], support automatic alignment of music with spoken audio [32,31], or enable linked editing between script writing and audio recording and editing [36]. Transcript-based video production systems analyze time-aligned video transcripts to identify points for inserting [14] or removing footage [1], allow for vocally-annotating raw footage [27,41], or enable the synthesis of short segments of talking-head video of puppets [3] and people [4]. Other systems use script transcript analysis to select relevant video clips [18], or leverage linguistic structures to create corresponding graphical structures [45].…”

Section: Transcript-based Audio and Video Editingmentioning

confidence: 99%