Leveraging Mixture of Experts for Improved Speech Deepfake Detection

Viola Negroni,Davide Salvi,Alessandro Ilic Mezza,Paolo Bestagini,Stefano Tubaro
2024-09-24
Abstract:Speech deepfakes pose a significant threat to personal security and content authenticity. Several detectors have been proposed in the literature, and one of the primary challenges these systems have to face is the generalization over unseen data to identify fake signals across a wide range of datasets. In this paper, we introduce a novel approach for enhancing speech deepfake detection performance using a Mixture of Experts architecture. The Mixture of Experts framework is well-suited for the speech deepfake detection task due to its ability to specialize in different input types and handle data variability efficiently. This approach offers superior generalization and adaptability to unseen data compared to traditional single models or ensemble methods. Additionally, its modular structure supports scalable updates, making it more flexible in managing the evolving complexity of deepfake techniques while maintaining high detection accuracy. We propose an efficient, lightweight gating mechanism to dynamically assign expert weights for each input, optimizing detection performance. Experimental results across multiple datasets demonstrate the effectiveness and potential of our proposed approach.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The paper attempts to address the issue of insufficient generalization ability in speech deepfake detection. Specifically, existing detection methods often show significant performance degradation when dealing with unseen data, especially when facing different generation methods and multilingual data. The paper proposes a new method based on the Mixture of Experts (MoE) framework, aiming to improve the generalization ability and adaptability of the detection system through the specialization of expert models and the ability to handle data variability. ### Main Issues 1. **Insufficient Generalization Ability**: Existing detection systems perform poorly when handling unseen data, especially when facing different generation methods and multilingual data. 2. **Data Variability**: Modern speech deepfake datasets contain various generation methods and multilingual data, making it difficult for a single model or traditional ensemble methods to handle effectively. 3. **Adaptability**: As deepfake technology continues to evolve, detection systems need to have the ability to flexibly update and adapt to new changes. ### Solution The paper proposes a method based on the Mixture of Experts (MoE) framework to address the above issues through the following ways: 1. **Expert Model Specialization**: Each expert model is pre-trained on a specific speech deepfake dataset, thus having high detection capability in specific domains. 2. **Dynamic Weight Allocation**: Introduces an efficient lightweight gating mechanism to dynamically allocate expert weights for each input, optimizing detection performance. 3. **Modular Structure**: The modular structure of the MoE framework supports scalable updates, allowing it to better manage the complexity of evolving deepfake technology while maintaining high detection accuracy. ### Experimental Results Experimental results show that the proposed MoE method performs excellently on multiple datasets, particularly in handling unseen data, with its generalization ability and adaptability significantly outperforming traditional single-model and ensemble methods. Specifically, the enhanced MoE (MoE (enhanced)) performs best across all datasets, demonstrating its advantages in handling diversity and complexity.