MCL: Multimodal Contrastive Learning for Deepfake Detection

Xiaolong Liu,Yang Yu,Xiaolong Li,Yao Zhao
DOI: https://doi.org/10.1109/tcsvt.2023.3312738
IF: 5.859
2023-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Advancements in computer vision and deep learning have led to difficulty in distinguishing Deepfake and real videos. In particular, forgery audios are also generated to accompany fake videos and make them more realistic, which makes Deepfake detection more difficult. Existing Deepfake detection methods that use multimodal information ignore the representation gap between different modalities, resulting in limited performance. To address this problem, in this paper, a novel Deepfake detection method utilizing multimodal contrastive learning (MCL) is proposed to better explore intra-modal and cross-modal forgery clues. To reduce the cross-modal gap and explore multimodal forgery artifacts, a cross-modal contrastive learning strategy is designed to learn a compositional embedding from multimodal information, which facilitates pulling together representations across uni-modalities and multi-modalities. Moreover, to supplement the intra-frame forgery clues mining ability of the video network, the frame knowledge is distilled to the video network without adding additional computation. Specifically, to mine intra-modal clues, three modality features are first extracted from audio, frame and video, respectively. Secondly, the audio and frame features are separately composed with the video feature to derive two cross-modal representations. Subsequently, these cross-modal features are contrastive with the intra-modal features to reduce cross-modal gap. By jointly pulling together the unimodal and multimodal features through MCL, a more effective representation that contains intra-modal and cross-modal forgery artifacts can be learned. Finally, a noise-based feature augmentation (NFA) module is proposed to adaptively perturb the audio-visual feature and further improve generalization performance. Extensive experiments demonstrate that the proposed framework outperforms SOTA methods.
What problem does this paper attempt to address?