Gumbel Rao Monte Carlo based Bi-Modal Neural Architecture Search for Audio-Visual Deepfake Detection

Aravinda Reddy PN,Raghavendra Ramachandra,Krothapalli Sreenivasa Rao,Pabitra Mitra Vinod Rathod
2024-10-09
Abstract:Deepfakes pose a critical threat to biometric authentication systems by generating highly realistic synthetic media. Existing multimodal deepfake detectors often struggle to adapt to diverse data and rely on simple fusion methods. To address these challenges, we propose Gumbel-Rao Monte Carlo Bi-modal Neural Architecture Search (GRMC-BMNAS), a novel architecture search framework that employs Gumbel-Rao Monte Carlo sampling to optimize multimodal fusion. It refines the Straight through Gumbel Softmax (STGS) method by reducing variance with Rao-Blackwellization, stabilizing network training. Using a two-level search approach, the framework optimizes the network architecture, parameters, and performance. Crucial features are efficiently identified from backbone networks, while within the cell structure, a weighted fusion operation integrates information from various sources. By varying parameters such as temperature and number of Monte carlo samples yields an architecture that maximizes classification performance and better generalisation capability. Experimental results on the FakeAVCeleb and SWAN-DF datasets demonstrate an impressive AUC percentage of 95.4\%, achieved with minimal model parameters.
Cryptography and Security,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key problems in audio - visual deepfake detection: 1. **Limitations of existing multi - modal deepfake detectors**: - Existing multi - modal deepfake detectors have difficulty adapting to diverse data. - They rely on simple fusion methods, resulting in unstable performance and poor generalization ability. 2. **Training instability and high - variance problems**: - The Straight - Through Gumbel Softmax (STGS) method introduces high variance during the training process, leading to unstable training dynamics. 3. **Excessive model parameters**: - Existing methods usually require a large number of parameters, increasing the computational cost and training time. 4. **Improving detection accuracy and generalization ability**: - A more efficient and stable automatic architecture search method is needed to optimize the network structure of audio - visual deepfake detection, thereby improving classification performance and generalization ability. To solve these problems, the authors propose a bi - modal neural architecture search framework based on Gumbel - Rao Monte Carlo sampling (GRMC - BMNAS). This framework improves existing methods in the following ways: - **Reducing variance**: Reduce variance through Rao - Blackwellization to stabilize network training. - **Optimizing fusion strategies**: Adopt a two - stage search method to optimize network architecture, parameters, and performance, efficiently identify important features and perform weighted fusion. - **Reducing model parameters**: By adjusting the temperature and the number of Monte Carlo samples, obtain a model with fewer parameters but higher performance. Experimental results show that GRMC - BMNAS achieves an AUC of 95.4% on the FakeAVCeleb and SWAN - DF datasets, with fewer model parameters and shorter training time. This indicates that this method not only improves detection accuracy but also enhances the generalization ability and training efficiency of the model.