Abstract:Deepfakes pose a critical threat to biometric authentication systems by generating highly realistic synthetic media. Existing multimodal deepfake detectors often struggle to adapt to diverse data and rely on simple fusion methods. To address these challenges, we propose Gumbel-Rao Monte Carlo Bi-modal Neural Architecture Search (GRMC-BMNAS), a novel architecture search framework that employs Gumbel-Rao Monte Carlo sampling to optimize multimodal fusion. It refines the Straight through Gumbel Softmax (STGS) method by reducing variance with Rao-Blackwellization, stabilizing network training. Using a two-level search approach, the framework optimizes the network architecture, parameters, and performance. Crucial features are efficiently identified from backbone networks, while within the cell structure, a weighted fusion operation integrates information from various sources. By varying parameters such as temperature and number of Monte carlo samples yields an architecture that maximizes classification performance and better generalisation capability. Experimental results on the FakeAVCeleb and SWAN-DF datasets demonstrate an impressive AUC percentage of 95.4\%, achieved with minimal model parameters.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve several key problems in audio - visual deepfake detection: 1. **Limitations of existing multi - modal deepfake detectors**: - Existing multi - modal deepfake detectors have difficulty adapting to diverse data. - They rely on simple fusion methods, resulting in unstable performance and poor generalization ability. 2. **Training instability and high - variance problems**: - The Straight - Through Gumbel Softmax (STGS) method introduces high variance during the training process, leading to unstable training dynamics. 3. **Excessive model parameters**: - Existing methods usually require a large number of parameters, increasing the computational cost and training time. 4. **Improving detection accuracy and generalization ability**: - A more efficient and stable automatic architecture search method is needed to optimize the network structure of audio - visual deepfake detection, thereby improving classification performance and generalization ability. To solve these problems, the authors propose a bi - modal neural architecture search framework based on Gumbel - Rao Monte Carlo sampling (GRMC - BMNAS). This framework improves existing methods in the following ways: - **Reducing variance**: Reduce variance through Rao - Blackwellization to stabilize network training. - **Optimizing fusion strategies**: Adopt a two - stage search method to optimize network architecture, parameters, and performance, efficiently identify important features and perform weighted fusion. - **Reducing model parameters**: By adjusting the temperature and the number of Monte Carlo samples, obtain a model with fewer parameters but higher performance. Experimental results show that GRMC - BMNAS achieves an AUC of 95.4% on the FakeAVCeleb and SWAN - DF datasets, with fewer model parameters and shorter training time. This indicates that this method not only improves detection accuracy but also enhances the generalization ability and training efficiency of the model.

Gumbel Rao Monte Carlo based Bi-Modal Neural Architecture Search for Audio-Visual Deepfake Detection

Straight Through Gumbel Softmax Estimator based Bimodal Neural Architecture Search for Audio-Visual Deepfake Detection

A Multimodal Framework for Deepfake Detection

A Robust Approach to Multimodal Deepfake Detection

Multimodal Deepfake Detection for Short Videos

A defensive attention mechanism to detect deepfake content across multiple modalities

Multimodaltrace: Deepfake Detection using Audiovisual Representation Learning

Multimodal Deepfake Detection

Evaluation of an Audio-Video Multimodal Deepfake Dataset using Unimodal and Multimodal Detectors

Temporal Feature Prediction in Audio–Visual Deepfake Detection

A Unified Framework for Modality-Agnostic Deepfakes Detection

Integrating Audio-Visual Features for Multimodal Deepfake Detection

Deepfake Detection System Using Deep Neural Networks

Contextual Cross-Modal Attention for Audio-Visual Deepfake Detection and Localization

MIS-AVoiDD: Modality Invariant and Specific Representation for Audio-Visual Deepfake Detection

Combating deepfakes: a comprehensive multilayer deepfake video detection framework

AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection

Real-Time Deepfake Video Detection Using Eye Movement Analysis with a Hybrid Deep Learning Approach

How Good is ChatGPT at Audiovisual Deepfake Detection: A Comparative Study of ChatGPT, AI Models and Human Perception

Statistics-aware Audio-visual Deepfake Detector

Detection of Deepfake Video Using Residual Neural Network and Long Short-Term Memory