ConTrans-Detect: A Multi-Scale Convolution-Transformer Network for DeepFake Video Detection

Weirong Sun,Yujun Ma,Hong Zhang,Ruili Wang
DOI: https://doi.org/10.1109/m2vip58386.2023.10413387
2023-01-01
Abstract:With the recent advancement of generative deep learning technologies, DeepFakes are the outcome of the manipulation to generate synthetic images, such as swapping a person's face in a video with another face in another video. Nowadays, deep generative models make it easy to generate fake videos, which is hard to detect. Existing methods have utilized Convolutional Neural Networks (CNNs) to identify manipulated regions for DeepFake video detection. However, these methods might not entirely tackle the difficulties of learning low-level spatial features and capturing temporal variations in temporal information, which are crucial for face forgery detection. Therefore, we propose a Convolution-Transformer Deepfake Detection (ConTrans-Detect) model, comprising a multi-scale CNN module for spatial feature representation and a multi-branch Transformer for temporal feature modeling. The multi-scale CNN module uses 3D Inception block to extract multi-scale low-level features (e.g., edges, corners, and angles) from videos. The multi-branch Transformer module consists of multi-stream Transformer layers, each taking different temporal resolutions and spatial feature dimensions as input to perceive various motion variations. Our model achieves an AUC of 0.929 and 0.920 f1 score, surpassing several state-of-the-art performances on the DeepFake Detection Challenge Datasets (DFDC).
What problem does this paper attempt to address?