Abstract:With the continuous improvements of deepfake methods, forgery messages have transitioned from single-modality to multi-modal fusion, posing new challenges for existing forgery detection algorithms. In this paper, we propose AVT2-DWF, the Audio-Visual dual Transformers grounded in Dynamic Weight Fusion, which aims to amplify both intra- and cross-modal forgery cues, thereby enhancing detection capabilities. AVT2-DWF adopts a dual-stage approach to capture both spatial characteristics and temporal dynamics of facial expressions. This is achieved through a face transformer with an n-frame-wise tokenization strategy encoder and an audio transformer encoder. Subsequently, it uses multi-modal conversion with dynamic weight fusion to address the challenge of heterogeneous information fusion between audio and visual modalities. Experiments on DeepfakeTIMIT, FakeAVCeleb, and DFDC datasets indicate that AVT2-DWF achieves state-of-the-art performance intra- and cross-dataset Deepfake detection. Code is available at

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that with the development of deepfake technology, the forged information has transitioned from a single modality to multi - modal fusion, which brings new challenges to the existing forgery detection algorithms. To meet this challenge, the author proposes a method based on audio - video dual - Transformer and dynamic weight fusion strategy - AVT2 - DWF (Audio - Visual Transformer with Dynamic Weight Fusion), aiming to enhance the detection ability by amplifying internal and cross - modal forgery clues. Specifically, the paper mentions that the existing single - modality detection methods perform poorly in cross - dataset performance, and although multi - modal audio - video forgery detection has made progress, it mostly focuses on the fusion of different modal features, ignoring the optimization of the intra - modal feature extraction scheme. Therefore, AVT2 - DWF optimizes the intra - modal feature extraction by adopting the n - frame segmentation strategy and uses the dynamic weight fusion module (DWF) to balance the fusion of cross - modal forgery clues to improve the detection ability. The main contributions of the paper include: 1. Using the n - frame segmentation strategy to enhance the extraction of facial features within video frames, including the nuances of facial expressions, movements and interactions. 2. Proposing a multi - modal conversion and dynamic weight fusion (DWF) mechanism to enhance the heterogeneous information fusion from audio and video modalities. 3. Integrating the above two methods, proposing a method called AVT2 - DWF, and demonstrating its wide applicability and significant effectiveness through widely recognized public benchmark tests.

AVT2-DWF: Improving Deepfake Detection with Audio-Visual Fusion and Dynamic Weighting Strategies