Abstract:Advancements in computer vision and deep learning have led to difficulty in distinguishing Deepfake and real videos. In particular, forgery audios are also generated to accompany fake videos and make them more realistic, which makes Deepfake detection more difficult. Existing Deepfake detection methods that use multimodal information ignore the representation gap between different modalities, resulting in limited performance. To address this problem, in this paper, a novel Deepfake detection method utilizing multimodal contrastive learning (MCL) is proposed to better explore intra-modal and cross-modal forgery clues. To reduce the cross-modal gap and explore multimodal forgery artifacts, a cross-modal contrastive learning strategy is designed to learn a compositional embedding from multimodal information, which facilitates pulling together representations across uni-modalities and multi-modalities. Moreover, to supplement the intra-frame forgery clues mining ability of the video network, the frame knowledge is distilled to the video network without adding additional computation. Specifically, to mine intra-modal clues, three modality features are first extracted from audio, frame and video, respectively. Secondly, the audio and frame features are separately composed with the video feature to derive two cross-modal representations. Subsequently, these cross-modal features are contrastive with the intra-modal features to reduce cross-modal gap. By jointly pulling together the unimodal and multimodal features through MCL, a more effective representation that contains intra-modal and cross-modal forgery artifacts can be learned. Finally, a noise-based feature augmentation (NFA) module is proposed to adaptively perturb the audio-visual feature and further improve generalization performance. Extensive experiments demonstrate that the proposed framework outperforms SOTA methods.

MCL: Multimodal Contrastive Learning for Deepfake Detection

Magnifying multimodal forgery clues for Deepfake detection

AVoiD-DF: Audio-Visual Joint Learning for Detecting Deepfake

A Unified Framework for Modality-Agnostic Deepfakes Detection

Audio-Visual Temporal Forgery Detection Using Embedding-Level Fusion and Multi-Dimensional Contrastive Loss

Joint Audio-Visual Attention with Contrastive Learning for More General Deepfake Detection

A Robust Approach to Multimodal Deepfake Detection

Multimodal Deepfake Detection for Short Videos

AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Video Deepfake Detection

A Deepfake Video Detection Method Based on Multi-Modal Deep Learning Method

Audio-Visual Contrastive Pre-train for Face Forgery Detection

MC-LCR: Multi-modal contrastive classification by locally correlated representations for effective face forgery detection

MC-LCR: Multimodal contrastive classification by locally correlated representations for effective face forgery detection

Learning to Detect Deepfakes via Adaptive Attention and Constrained Difference.

Unsupervised Multimodal Deepfake Detection Using Intra- and Cross-Modal Inconsistencies

Towards General Visual-Linguistic Face Forgery Detection.

MCL: A Contrastive Learning Method for Multimodal Data Fusion in Violence Detection

Multimodaltrace: Deepfake Detection using Audiovisual Representation Learning

AVForensics: Audio-driven Deepfake Video Detection with Masking Strategy in Self-supervision.

A Multimodal Framework for Deepfake Detection

PVASS-MDD: Predictive Visual-audio Alignment Self-supervision for Multimodal Deepfake Detection