Exploring Temporal Coherence for More General Video Face Forgery Detection

Yinglin Zheng,Jianmin Bao,Dong Chen,Ming Zeng,Fang Wen
DOI: https://doi.org/10.48550/arXiv.2108.06693
2021-08-15
Abstract:Although current face manipulation techniques achieve impressive performance regarding quality and controllability, they are struggling to generate temporal coherent face videos. In this work, we explore to take full advantage of the temporal coherence for video face forgery detection. To achieve this, we propose a novel end-to-end framework, which consists of two major stages. The first stage is a fully temporal convolution network (FTCN). The key insight of FTCN is to reduce the spatial convolution kernel size to 1, while maintaining the temporal convolution kernel size unchanged. We surprisingly find this special design can benefit the model for extracting the temporal features as well as improve the generalization capability. The second stage is a Temporal Transformer network, which aims to explore the long-term temporal coherence. The proposed framework is general and flexible, which can be directly trained from scratch without any pre-training models or external datasets. Extensive experiments show that our framework outperforms existing methods and remains effective when applied to detect new sorts of face forgery videos.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the **Temporal Coherence problem** in video face forgery detection. Specifically, although current face - manipulation techniques perform well in terms of image quality and controllability, they face challenges when generating temporally - coherent facial videos. Forged facial videos are usually generated independently on a frame - by - frame basis, which inevitably leads to flickering and discontinuity in the face regions in the video (see Figure 1). Therefore, using these temporal incoherencies can more generally and robustly detect forged facial videos. #### Main problems and solutions 1. **Temporal Coherence problem**: - Current forged videos mainly contain two types of artifacts: spatially - related artifacts (such as fusion boundaries, checkerboards, blurring artifacts) and temporal incoherencies. - Spatially - related artifacts are usually more prominent than temporal incoherencies, causing existing spatio - temporal convolutional networks to rely more on spatial artifacts for classification rather than temporal incoherencies. 2. **Limitations of existing methods**: - Most existing detection methods are trained for known face - manipulation techniques, and their performance drops significantly when encountering unknown manipulation methods. - Some methods are very sensitive to common perturbations (such as image or video compression, noise, etc.), limiting their generalization ability. 3. **Proposed new framework**: - To make full use of temporal coherence, the authors propose a new end - to - end framework, which consists of two main stages: - **First stage: Fully Temporal Convolutional Network (FTCN)**: By reducing the size of the spatial convolution kernel to 1 while keeping the size of the temporal convolution kernel unchanged, the network is forced to learn temporal features and improve generalization ability. - **Second stage: Temporal Transformer Network**: Used to capture long - term temporal coherence. 4. **Innovative points**: - This framework does not require any pre - trained models or external datasets and can be directly trained from scratch. - Through extensive experimental verification, this framework outperforms existing methods in various challenging scenarios and is still effective for detecting new types of facial forgery videos. ### Summary The main contribution of this paper lies in exploring how to make full use of temporal coherence to detect video face forgeries, and proposes a framework that combines the Fully Temporal Convolutional Network (FTCN) and the Temporal Transformer, thereby achieving more general and robust video face forgery detection.