MSVT: Multiple Spatiotemporal Views Transformer for DeepFake Video Detection

Yang Yu,Rongrong Ni,Yao Zhao,Siyuan Yang,Fen Xia,Ning Jiang,Guoqing Zhao
DOI: https://doi.org/10.1109/tcsvt.2023.3281448
IF: 5.859
2023-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Recently, DeepFake videos have developed rapidly, causing new security issues in society. Due to the rough spatiotemporal view, existing video-based detection methods struggle to capture fine-grained spatiotemporal information, resulting in limited generalization ability. In addition, although the transformer has achieved great success in the past few years, the application of transformer on deepfake video detection still needs to be studied. To solve this problem, in this paper, we propose a novel Multiple Spatiotemporal Views Transformer (MSVT) with Local Spatiotemporal View (LSV) and Global Spatiotemporal View (GSV), to mine more detailed spatiotemporal information. Firstly, for establishing the LSV, different from existing works that sparsely sample a single frame to build the input sequence, we employ the local-consecutive temporal view to capture vital dynamic inconsistency. Furthermore, the extracted frame features within each group are fed to the temporal transformer followed by the feature fusion module, to generate group-level spatiotemporal features. Then, we further establish Global Spatiotemporal View (GSV) by feeding all the frame features within the whole video to the temporal transformer followed by the feature fusion module. Finally, we propose a novel global-local transformer (GLT) to effectively integrate these multi-level features for mining more subtle and comprehensive features. Extensive experiments on six large datasets demonstrate that our MSVT outperforms state-of-the-art detection methods.
What problem does this paper attempt to address?