Abstract:Although current face manipulation techniques achieve impressive performance regarding quality and controllability, they are struggling to generate temporal coherent face videos. In this work, we explore to take full advantage of the temporal coherence for video face forgery detection. To achieve this, we propose a novel end-to-end framework, which consists of two major stages. The first stage is a fully temporal convolution network (FTCN). The key insight of FTCN is to reduce the spatial convolution kernel size to 1, while maintaining the temporal convolution kernel size unchanged. We surprisingly find this special design can benefit the model for extracting the temporal features as well as improve the generalization capability. The second stage is a Temporal Transformer network, which aims to explore the long-term temporal coherence. The proposed framework is general and flexible, which can be directly trained from scratch without any pre-training models or external datasets. Extensive experiments show that our framework outperforms existing methods and remains effective when applied to detect new sorts of face forgery videos.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the **Temporal Coherence problem** in video face forgery detection. Specifically, although current face - manipulation techniques perform well in terms of image quality and controllability, they face challenges when generating temporally - coherent facial videos. Forged facial videos are usually generated independently on a frame - by - frame basis, which inevitably leads to flickering and discontinuity in the face regions in the video (see Figure 1). Therefore, using these temporal incoherencies can more generally and robustly detect forged facial videos. #### Main problems and solutions 1. **Temporal Coherence problem**: - Current forged videos mainly contain two types of artifacts: spatially - related artifacts (such as fusion boundaries, checkerboards, blurring artifacts) and temporal incoherencies. - Spatially - related artifacts are usually more prominent than temporal incoherencies, causing existing spatio - temporal convolutional networks to rely more on spatial artifacts for classification rather than temporal incoherencies. 2. **Limitations of existing methods**: - Most existing detection methods are trained for known face - manipulation techniques, and their performance drops significantly when encountering unknown manipulation methods. - Some methods are very sensitive to common perturbations (such as image or video compression, noise, etc.), limiting their generalization ability. 3. **Proposed new framework**: - To make full use of temporal coherence, the authors propose a new end - to - end framework, which consists of two main stages: - **First stage: Fully Temporal Convolutional Network (FTCN)**: By reducing the size of the spatial convolution kernel to 1 while keeping the size of the temporal convolution kernel unchanged, the network is forced to learn temporal features and improve generalization ability. - **Second stage: Temporal Transformer Network**: Used to capture long - term temporal coherence. 4. **Innovative points**: - This framework does not require any pre - trained models or external datasets and can be directly trained from scratch. - Through extensive experimental verification, this framework outperforms existing methods in various challenging scenarios and is still effective for detecting new types of facial forgery videos. ### Summary The main contribution of this paper lies in exploring how to make full use of temporal coherence to detect video face forgeries, and proposes a framework that combines the Fully Temporal Convolutional Network (FTCN) and the Temporal Transformer, thereby achieving more general and robust video face forgery detection.

Exploring Temporal Coherence for More General Video Face Forgery Detection

A Temporal Consistency Learning Framework for Face Forgery Detection

Analyzing temporal coherence for deepfake video detection

Unified Video and Image Representation for Boosted Video Face Forgery Detection

Learning Natural Consistency Representation for Face Forgery Video Detection

Latent Spatiotemporal Adaptation for Generalized Face Forgery Video Detection

UniForensics: Face Forgery Detection via General Facial Representation

FakeTransformer: Exposing Face Forgery From Spatial-Temporal Representation Modeled By Facial Pixel Variations

Spatial-temporal Transformer Network for Protecting Person-of-interest from Deepfaking

Dynamic Difference Learning with Spatio-temporal Correlation for Deepfake Video Detection

Learning Multi-Granularity Temporal Characteristics for Face Anti-Spoofing

AltFreezing for More General Video Face Forgery Detection

F2Trans: High-Frequency Fine-Grained Transformer for Face Forgery Detection

Generalizing Deepfake Video Detection with Plug-and-Play: Video-Level Blending and Spatiotemporal Adapter Tuning

Constructing Spatio-Temporal Graphs for Face Forgery Detection

Research on video face forgery detection model based on multiple feature fusion network

Face Forgery Detection with Long-Range Noise Features and Multilevel Frequency-Aware Clues

Exposing video surveillance object forgery by combining TSF features and attention-based deep neural networks

Face forgery detection by progressively enhancing spatial and frequency-aware features

Face Forensics in the Wild