AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

Trevine Oorloff,Surya Koppisetti,Nicolò Bonettini,Divyaraj Solanki,Ben Colman,Yaser Yacoob,Ali Shahriyari,Gaurav Bharaj

2024-06-05

Abstract:With the rapid growth in deepfake video content, we require improved and generalizable methods to detect them. Most existing detection methods either use uni-modal cues or rely on supervised training to capture the dissonance between the audio and visual modalities. While the former disregards the audio-visual correspondences entirely, the latter predominantly focuses on discerning audio-visual cues within the training corpus, thereby potentially overlooking correspondences that can help detect unseen deepfakes. We present Audio-Visual Feature Fusion (AVFF), a two-stage cross-modal learning method that explicitly captures the correspondence between the audio and visual modalities for improved deepfake detection. The first stage pursues representation learning via self-supervision on real videos to capture the intrinsic audio-visual correspondences. To extract rich cross-modal representations, we use contrastive learning and autoencoding objectives, and introduce a novel audio-visual complementary masking and feature fusion strategy. The learned representations are tuned in the second stage, where deepfake classification is pursued via supervised learning on both real and fake videos. Extensive experiments and analysis suggest that our novel representation learning paradigm is highly discriminative in nature. We report 98.6% accuracy and 99.1% AUC on the FakeAVCeleb dataset, outperforming the current audio-visual state-of-the-art by 14.9% and 9.9%, respectively.

Computer Vision and Pattern Recognition,Multimedia,Sound,Audio and Speech Processing

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve With the rapid growth of deepfake video content, we need improved and generalized detection methods to identify these fake videos. Most existing detection methods either rely solely on clues from a single modality (such as visual or audio) or capture inconsistencies between audio and visual modalities through supervised training. However, the former completely ignores the correspondence between audio and visual, while the latter mainly focuses on distinguishing audio and visual clues in the training dataset, potentially overlooking the correspondence that helps detect unseen deepfake videos. This paper proposes a two-stage cross-modal learning method called "Audio-Visual Feature Fusion" (AVFF) to improve the detection of deepfake videos by explicitly capturing the correspondence between audio and visual modalities. Specifically: 1. **Representation Learning Stage**: Extract rich cross-modal representations through self-supervised learning on real videos. Use contrastive learning and autoencoder objectives, and introduce a new audio-visual complementary mask and feature fusion strategy. 2. **Classification Stage**: Perform supervised learning on real and fake videos to achieve deepfake video classification. Through this method, the authors hope to fully utilize the intrinsic correspondence between audio and visual modalities when detecting deepfake videos, thereby improving detection accuracy and robustness. Experimental results show that this method significantly outperforms existing audio-visual state-of-the-art methods on multiple benchmark datasets, particularly achieving 98.6% accuracy and 99.1% AUC on the FakeAVCeleb dataset, which are improvements of 14.9% and 9.9% respectively over existing methods.

AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

Efficient Audiovisual Fusion for Active Speaker Detection.

AVoiD-DF: Audio-Visual Joint Learning for Detecting Deepfake

MIS-AVoiDD: Modality Invariant and Specific Representation for Audio-Visual Deepfake Detection

A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection

Detecting Audio-Visual Deepfakes with Fine-Grained Inconsistencies

Joint Audio-Visual Attention with Contrastive Learning for More General Deepfake Detection

AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Video Deepfake Detection

AVT2-DWF: Improving Deepfake Detection with Audio-Visual Fusion and Dynamic Weighting Strategies

AVForensics: Audio-driven Deepfake Video Detection with Masking Strategy in Self-supervision.

Audio-visual Deepfake Detection Using Articulatory Representation Learning

Statistics-aware Audio-visual Deepfake Detector

Audio-Visual Temporal Forgery Detection Using Embedding-Level Fusion and Multi-Dimensional Contrastive Loss

Temporal Feature Prediction in Audio–Visual Deepfake Detection

A Multimodal Framework for Deepfake Detection

A Robust Approach to Multimodal Deepfake Detection

A Unified Framework for Modality-Agnostic Deepfakes Detection

Contextual Cross-Modal Attention for Audio-Visual Deepfake Detection and Localization

AVTENet: Audio-Visual Transformer-based Ensemble Network Exploiting Multiple Experts for Video Deepfake Detection

AVSecure: an Audio-Visual Watermarking Framework for Proactive Deepfake Detection

Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal and Multimodal Detectors