Exploring Temporal Coherence for More General Video Face Forgery Detection

Zheng, Yinglin; Bao, Jianmin; Chen, Dong; Zhang, Ming; Wen, Fang

doi:10.1109/iccv48922.2021.01477

Cited by 131 publications

(48 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table 1 shows results obtained by RealForensics on each manipulation type in the FF++ dataset after training on the remaining types. Our detector works on par with the state-of-the-art without (1) using auxiliary labelled supervision [52], (2) heavily constraining the network by freezing large parts [52] or removing spatial convolutions [112], nor (3) using audio at test-time [115]. We also outperform the baseline of training a CSN [98] network on the forgery data (with the same augmentations as RealForensics), indicating the effectiveness of leveraging real data using our approach.…”

Section: Cross-manipulation Generalisationmentioning

confidence: 88%

“…Unlike our method, it requires a large-scale labelled dataset and focuses exclusively on the mouth region. Very recently, [112] report high generalisation by reducing the spatial kernel sizes of convolutional layers to 1, thus learning temporal inconsistencies while ignoring spatial ones. By contrast, we target spatiotemporal irregularities that may be more consistent with human perception of forgery cues.…”

Section: Face Forgery Detectionmentioning

confidence: 99%

“…A deployed detector is expected to recognise fake videos that were created using methods not seen during training, a non-trivial task in practice [52,65,82,112]. In this section, we follow the protocol used in [52,67,82] to evaluate our detector's ability to generalise to unseen manipulations.…”

Section: Cross-manipulation Generalisationmentioning

confidence: 99%

“…It is reasonable to believe that incorporating the temporal dimension (along with the spatial ones) can improve performance, especially since many synthesis methods do not take into account temporal consistency during the generation process [91]. However, as with frame-based methods, naively training deep networks on videos can lead to strong overfitting to the seen forgeries [52,94,112]. To counteract this, LipForensics [52] trains a network on a "/vol/paramonos2/datasets2/DFDC/cropped_faces_jie_2/dfdc_train_part_22/odyucmgcvg/track_0 m_0.avi" Figure 2.…”

Section: Introductionmentioning

confidence: 99%

“…On the other hand, (1) it requires pretraining on a labelled dataset, limiting its scalability; (2) it focuses exclusively on the mouth region; and (3) it freezes almost one third of the network when training on forgery data, which could sacrifice performance. A very recent method, FTCN [112], demonstrates high cross-manipulation generalisation by constraining all spatial convolutional kernel sizes to one. But, as we show, the impressive generalisation may come at a cost of reduced robustness to compression changes.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection

Haliassos¹,

Mira²,

Petridis³

et al. 2022

Preprint

View full text Add to dashboard Cite

One of the most pressing challenges for the detection of face-manipulated videos is generalising to forgery methods not seen during training while remaining effective under common corruptions such as compression. In this paper, we question whether we can tackle this issue by harnessing videos of real talking faces, which contain rich information on natural facial appearance and behaviour and are readily available in large quantities online. Our method, termed RealForensics, consists of two stages. First, we exploit the natural correspondence between the visual and auditory modalities in real videos to learn, in a self-supervised crossmodal manner, temporally dense video representations that capture factors such as facial movements, expression, and identity. Second, we use these learned representations as targets to be predicted by our forgery detector along with the usual binary forgery classification task; this encourages it to base its real/fake decision on said factors. We show that our method achieves state-of-the-art performance on cross-manipulation generalisation and robustness experiments, and examine the factors that contribute to its performance. Our results suggest that leveraging natural and unlabelled videos is a promising direction for the development of more robust face forgery detectors.

show abstract

Section: Cross-manipulation Generalisationmentioning

confidence: 88%

Section: Face Forgery Detectionmentioning

confidence: 99%

Section: Cross-manipulation Generalisationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection

Haliassos¹,

Mira²,

Petridis³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Adaptive Face Forgery Detection in Cross Domain

Song¹,

Fang²,

Li³

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

UIA-ViT: Unsupervised Inconsistency-Aware Method Based on Vision Transformer for Face Forgery Detection

Zhuang

Chu

Tan

et al. 2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Intra-frame inconsistency has been proved to be effective for the generalization of face forgery detection. However, learning to focus on these inconsistency requires extra pixel-level forged location annotations. Acquiring such annotations is non-trivial. Some existing methods generate large-scale synthesized data with location annotations, which is only composed of real images and cannot capture the properties of forgery regions. Others generate forgery location labels by subtracting paired real and fake images, yet such paired data is difficult to collected and the generated label is usually discontinuous. To overcome these limitations, we propose a novel Unsupervised Inconsistency-Aware method based on Vision Transformer, called UIA-ViT, which only makes use of video-level labels and can learn inconsistency-aware feature without pixel-level annotations. Due to the self-attention mechanism, the attention map among patch embeddings naturally represents the consistency relation, making the vision Transformer suitable for the consistency representation learning. Based on vision Transformer, we propose two key components: Unsupervised Patch Consistency Learning (UPCL) and Progressive Consistency Weighted Assemble (PCWA). UPCL is designed for learning the consistency-related representation with progressive optimized pseudo annotations. PCWA enhances the final classification embedding with previous patch embeddings optimized by UPCL to further improve the detection performance. Extensive experiments demonstrate the effectiveness of the proposed method.

show abstract

Exploring Temporal Coherence for More General Video Face Forgery Detection

Cited by 131 publications

References 39 publications

Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection

Leveraging Real Talking Faces via Self-Supervision for Robust Forgery Detection

Adaptive Face Forgery Detection in Cross Domain

UIA-ViT: Unsupervised Inconsistency-Aware Method Based on Vision Transformer for Face Forgery Detection

Contact Info

Product

Resources

About