AV-Lip-Sync+: Leveraging AV-HuBERT to Exploit Multimodal Inconsistency for Video Deepfake Detection

Sahibzada Adil Shahzad,Ammarah Hashmi,Yan-Tsung Peng,Yu Tsao,Hsin-Min Wang
2023-11-06
Abstract:Multimodal manipulations (also known as audio-visual deepfakes) make it difficult for unimodal deepfake detectors to detect forgeries in multimedia content. To avoid the spread of false propaganda and fake news, timely detection is crucial. The damage to either modality (i.e., visual or audio) can only be discovered through multi-modal models that can exploit both pieces of information simultaneously. Previous methods mainly adopt uni-modal video forensics and use supervised pre-training for forgery detection. This study proposes a new method based on a multi-modal self-supervised-learning (SSL) feature extractor to exploit inconsistency between audio and visual modalities for multi-modal video forgery detection. We use the transformer-based SSL pre-trained Audio-Visual HuBERT (AV-HuBERT) model as a visual and acoustic feature extractor and a multi-scale temporal convolutional neural network to capture the temporal correlation between the audio and visual modalities. Since AV-HuBERT only extracts visual features from the lip region, we also adopt another transformer-based video model to exploit facial features and capture spatial and temporal artifacts caused during the deepfake generation process. Experimental results show that our model outperforms all existing models and achieves new state-of-the-art performance on the FakeAVCeleb and DeepfakeTIMIT datasets.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning,Multimedia,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the detection of multimodal fake videos (i.e., audio - visual deepfakes). With the development of deepfake technology, unimodal (audio - only or video - only) fake detectors are becoming more and more difficult to detect fakes in multimedia content. Therefore, this paper proposes a new method based on multimodal self - supervised learning (SSL) feature extraction, which uses the inconsistency between audio and visual modalities to detect multimodal video fakes. Specifically, this method uses the Transformer - based pre - trained audio - visual HuBERT (A V - HuBERT) model as a visual and acoustic feature extractor, and adopts a multi - scale temporal convolutional neural network to capture the temporal correlation between audio and visual modalities. In addition, since A V - HuBERT only extracts visual features from the lip area, the study also adopts another Transformer - based video model to utilize facial features and capture the spatio - temporal artifacts generated during the generation of deepfakes. ### Main contributions of the paper: 1. **Adopting a Transformer - based audio - visual feature extraction method**: Different from the existing feature extraction methods based on convolutional neural networks (CNN), this method uses a more advanced Transformer architecture. 2. **Effectively capturing audio - visual correlation and synchronization**: Through the pre - trained self - supervised learning model, it can effectively capture the correlation and synchronization between audio and video modalities. 3. **Achieving state - of - the - art performance on two multimodal deepfake datasets**: This method performs well on the FakeA VCeleb and DeepfakeTIMIT datasets and is robust to various deepfake manipulation techniques. ### Method overview: - **Audio - visual feature extractor**: Consists of ResNet - 18, a lightweight feed - forward network (FFN), and a Transformer encoder, which are respectively used to extract the visual features of lip images and the acoustic features of audio waveforms. - **Lip image feature extractor**: The output of ResNet - 18 is sent to the Transformer encoder to generate a sequence of lip image embeddings. - **Acoustic feature extractor**: The output of FFN is sent to the Transformer encoder to generate a sequence of acoustic feature embeddings. - **Synchronization check module**: Calculate the absolute difference between the lip embedding and the acoustic embedding of each frame to generate a sequence of synchronization feature vectors. - **Feature fusion module**: Concatenate the sequence of synchronization feature vectors and the sequence of audio - visual feature vectors along the feature dimension to form a sequence of fusion representations. - **Temporal convolutional network and classifier**: Adopt a multi - scale temporal convolutional network (MS - TCN) to capture temporal correlations, and finally perform classification through a temporal pooling layer and a linear layer. - **Combined with facial encoder**: In order to enhance the robustness and generalization ability of the model to deep facial manipulation techniques, a pre - trained Video Vision Transformer (ViViT) is introduced as a facial encoder to extract spatio - temporal facial features. ### Experimental results: - **Performance on the FakeA VCeleb dataset**: On various test sets, such as Faceswap, Fsgan, RTVC, wav2lip and their combinations, the A V - Lip - Sync+ model has achieved high precision, recall, F1 - score and accuracy. - **Performance on the DeepfakeTIMIT dataset**: Through 5 - fold cross - validation, this model also performs well. In conclusion, this paper proposes an innovative multimodal deepfake detection method, which significantly improves the detection performance by using the inconsistency between audio and visual modalities.