Vision Transformer Based Video Hashing Retrieval for Tracing the Source of Fake Videos

Pengfei Pei,Xianfeng Zhao,Yun Cao,Jinchuan Li,Xuyuan Lai
DOI: https://doi.org/10.48550/arXiv.2112.08117
2022-09-06
Abstract:In recent years, the spread of fake videos has brought great influence on individuals and even countries. It is important to provide robust and reliable results for fake videos. The results of conventional detection methods are not reliable and not robust for unseen videos. Another alternative and more effective way is to find the original video of the fake video. For example, fake videos from the Russia-Ukraine war and the Hong Kong law revision storm are refuted by finding the original video. We use an improved retrieval method to find the original video, named ViTHash. Specifically, tracing the source of fake videos requires finding the unique one, which is difficult when there are only small differences in the original videos. To solve the above problems, we designed a novel loss Hash Triplet Loss. In addition, we designed a tool called Localizator to compare the difference between the original traced video and the fake video. We have done extensive experiments on FaceForensics++, Celeb-DF and DeepFakeDetection, and we also have done additional experiments on our built three datasets: DAVIS2016-TL (video inpainting), VSTL (video splicing) and DFTL (similar videos). Experiments have shown that our performance is better than state-of-the-art methods, especially in cross-dataset mode. Experiments also demonstrated that ViTHash is effective in various forgery detection: video inpainting, video splicing and deepfakes. Our code and datasets have been released on GitHub: \url{<a class="link-external link-https" href="https://github.com/lajlksdf/vtl" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the traceability of forged videos. Specifically, with the wide spread of forged videos, it has had a huge impact on individuals and even countries. Therefore, it has become particularly important to provide a robust and reliable method to identify the source of forged videos. Traditional detection methods do not work well for unseen videos, and finding the original video of the forged video is a more effective way. For example, in the Russia - Ukraine war and the Hong Kong unrest over the proposed extradition law, the practice of refuting forged videos by finding the original videos has proven its effectiveness. To meet the above challenges, the author proposes a video hash retrieval method based on Vision Transformer (ViT), named ViTHash. This method aims to track the original video of the forged video by generating a unique hash code, even if there are only slight differences between these videos. To this end, the author designs a new loss function - Hash Triplet Loss, and a tool named Localizator, which is used to compare the differences between the tracked original video and the forged video. The main contributions of the paper include: 1. **Novel architecture design**: A new architecture is designed to detect forged videos by tracking their sources, providing irrefutable evidence instead of outputting a possible value. 2. **Hash Triplet Loss**: A new loss function is designed, which helps to better distinguish between the original video and similar videos with only slight differences. 3. **New data set**: Due to the lack of relevant object - forged video data sets, the author constructs three data sets to verify the performance of their method in different forging scenarios. Through these innovations, ViTHash performs excellently in a variety of forgery detection tasks, especially in the cross - data set mode. Experimental results show that ViTHash outperforms existing methods in video inpainting, video splicing, and deep - fake detection.