CNN and Transformer-coordinated Deepfake Detection
Li Ying,Bian Shan,Wang Chuntao,Lu Wei
DOI: https://doi.org/10.11834/jig.220519
2023-01-01
Journal of Image and Graphics
Abstract:ObjectiveThe research of deepfake detection methods has become one of the hot topics recently to counter deepfake videos. Its purpose is to identify fake videos synthesized by deep forgery technology on social networks, such as We Chat, Instagram and Tik Tok. Forged features are extracted on the basis of a convolutional neural network(CNN) and the final classification score is determined in terms of the features-forged classifier. When facing the deep forged video with low quality or high compression, these methods improve the detection performance by extracting deeper spatial domain information. However, the forged features left in the spatial domain decrease with the compression, and the local features tend to be similar, which degrades the performances severely. This also urges us to retain the frequency domain information of forged image artifacts as one of the clues of forensics, which contains less interference caused by JPEG compression. The CNN-based spatial domain feature extraction method can be conducted to capture facial artifacts via stacking convolution.But, its receptive field is limited, so it is better at modelling local information but ignores the relationship between global pixels. Transformer has its potentials at long-term dependency modelling in relevant to natural language processing and computer vision tasks, therefore it is usually employed to model the relationship between pixels of images and make up for the CNN-based deficiency in global information acquisition. However, the transformer can only process sequence information,making it still need the cooperation of convolutional neural network in computer vision tasks.MethodFirst, we develop a novel joint detection model, which can leverage the advantages of CNN and transformer, and enriches the feature representation via frequency domain-related information. The Efficient Net-b0 is as the feature extractor. To optimize more forensics features, in the spatial feature extraction stage, the attention module is embedded in the shallow layer and the deep features are multiplied with the activation map obtained by the attention module. In the frequency domain feature extraction stage, to better learn the frequency domain features, we utilize the discrete cosine transform as the frequency domain transform means and an adaptive part is added to the frequency band decomposition. In the training process, to accelerate the memory-efficient training, we adopt the method of mixed precision training. Then, to construct the joint model, we link the feature extraction branches to a modified Transformer structure. The Transformer is used to model inter-region feature correlation using global self-attention feature encoding through an encoder structure. To further realize the information interaction between the dual-domain features, the cross attention is calculated between branches on the basis of the cross-attention structure. Furthermore, we design and implement a random data augmentation strategy, which is coordinated with the attention mechanism to improve the detection accuracy of the model in the scenarios of cross compression rate and cross dataset.ResultOur joint model is compared to 9 state-of-the-art deepfake detection methods on two datasets called Face Forensics ++(FF ++) and Celeb-DF. In the experiments of cross compression-rate detection on the FF ++ dataset, our detection accuracy can be reached to 90. 35%, 71. 79%and 80. 71%for Deepfakes, Face2Face and Neural Textures(NT) manipulated images, respectively. In the cross-dataset experiments, i. e., training on Face Forensics ++ and testing on Celeb-DF, our training time is reduced.ConclusionThe experiments demonstrate that our joint model proposed can improve datasetscrossed and compression-rate acrossed detection accuracy. Our joint model takes advantage of the Efficient Net and the Transformer, and combines the characteristics of different domain features, attention, and data augmentation mechanism,making the model more accurate and efficient.