Audio-Visual Temporal Forgery Detection Using Embedding-Level Fusion and Multi-Dimensional Contrastive Loss

Miao Liu,Jing Wang,Xinyuan Qian,Haizhou Li
DOI: https://doi.org/10.1109/tcsvt.2023.3326694
IF: 5.859
2023-01-01
IEEE Transactions on Circuits and Systems for Video Technology
Abstract:Audio-visual deepfake detection is the process of identifying and detecting deepfakes that have been generated using both audio and visual content with AI algorithms. Most existing methods primarily focus on the overall authenticity while neglecting the position of forgeries in time. This can be particularly problematic, as even a small alteration in a clip can significantly impact its meaning. Such brand new attacks are dangerous and how to tackle such attacks remains an open question. In this paper, we present a novel neural network-based model to tackle the temporal forgery detection (TFD) problem. It consists of new audio and visual encoders with cross-modal attention for embedding extraction, and an embedding-level fusion mechanism with self-attention for forgery localization. Besides, a multi-dimensional contrastive loss is proposed which helps the model not only to capture audio-visual inconsistency for deepfake detection but also to exploit temporal inconsistency by coherently constraining the extracted embeddings. Extensive experiments on the LAV-DF dataset show that the presented method outperforms several state-of-the-art temporal forgery localization methods by up to 23.4% on AP@0.5 and 13.8% on AR@100. In addition, we also show the effectiveness of the proposed model on deepfake detection.
engineering, electrical & electronic
What problem does this paper attempt to address?