MesoNet: a Compact Facial Video Forgery Detection Network

Darius Afchar,Vincent Nozick,Junichi Yamagishi,Isao Echizen
DOI: https://doi.org/10.1109/WIFS.2018.8630761
2018-09-04
Abstract:This paper presents a method to automatically and efficiently detect face tampering in videos, and particularly focuses on two recent techniques used to generate hyper-realistic forged videos: Deepfake and Face2Face. Traditional image forensics techniques are usually not well suited to videos due to the compression that strongly degrades the data. Thus, this paper follows a deep learning approach and presents two networks, both with a low number of layers to focus on the mesoscopic properties of images. We evaluate those fast networks on both an existing dataset and a dataset we have constituted from online videos. The tests demonstrate a very successful detection rate with more than 98% for Deepfake and 95% for Face2Face.
Computer Vision and Pattern Recognition,Image and Video Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to automatically and efficiently detect face forgery in videos, especially for two techniques that generate surreal forged videos: Deepfake and Face2Face. Traditional image forensics techniques are usually not applicable to videos because video compression will severely degrade the data, making it difficult for these techniques to be effectively applied. Therefore, this paper adopts a deep - learning method and proposes two networks with fewer layers, focusing on the mesoscopic characteristics of images to overcome the challenges brought by video compression, and tests on existing datasets and the dataset constructed by the authors, showing a Deepfake detection rate of over 98% and a Face2Face detection rate of 95%. ### Background of Deepfake Detection With the popularization of smart phones and the development of social networks, digital images and videos have become very common digital objects. It is reported that nearly 2 billion pictures are uploaded to the Internet every day. This huge amount of use is accompanied by the rise of image content tampering techniques, such as using editing software like Photoshop. The field of digital image forensics research is dedicated to detecting image forgeries to regulate the spread of false content. Although there are already many methods for detecting image forgeries, video forgery detection is still a difficult problem, mainly due to the strong degradation of frames after video compression. ### Deepfake and Face2Face Technologies - **Deepfake**: Face - swapping is achieved by training two auto - encoders. One auto - encoder is used to reconstruct the face image of target person A, and the other auto - encoder is used to reconstruct the face image of source person B. The two auto - encoders share the weights of the encoding part, but the decoding parts remain independent. This method can generate highly realistic forged videos, but it also has some flaws, such as failure when the face is occluded and blurred details. - **Face2Face**: Facial re - enactment is achieved by real - time tracking of facial expressions in the source video and the target video, and then synthesizing the expressions of the source video onto the face of the target video. This method does not require deep learning but uses traditional computer vision techniques. ### Proposed Method This paper proposes a deep neural network method based on mesoscopic analysis, aiming to detect forged videos generated by Deepfake and Face2Face. Specifically, two network architectures are proposed: - **Meso - 4**: It contains four convolutional and pooling layers, followed by a fully - connected network with a hidden layer. ReLU activation function, batch normalization and Dropout are used to improve generalization ability and robustness. - **MesoInception - 4**: Based on Meso - 4, the first two convolutional layers are replaced with variants of Inception modules, using 3×3 dilated convolutions to avoid the introduction of high - semantic information, and adding 1×1 convolutions for dimension reduction and skip connections. ### Experimental Results - **Deepfake Dataset**: The accuracies of the two networks in independent frame classification are 89.1% and 91.7% respectively. Through image aggregation, the detection rates are further increased to 96.9% and 98.4%. - **Face2Face Dataset**: Under different compression levels, the classification accuracies of Meso - 4 and MesoInception - 4 are 94.6%, 92.4%, 83.2% and 96.8%, 93.4%, 81.3% respectively. Through image aggregation, the detection rate is increased to 95.3%. ### Conclusion The network architectures proposed in this paper have a high detection rate for forged videos generated by Deepfake and Face2Face under actual conditions. By visualizing the layers and filters of the network, the study found that the eye and mouth regions play a key role in Deepfake detection, while the background region is often more blurred in forged images. Future research will further improve the understanding of deep networks to create more effective and efficient detection methods.