Abstract:Speech synthesis technology has posed a serious threat to speaker verification systems. Currently, the most effective fake audio detection methods utilize pretrained models, and integrating features from various layers of pretrained model further enhances detection performance. However, most of the previously proposed fusion methods require fine-tuning the pretrained models, resulting in excessively long training times and hindering model iteration when facing new speech synthesis technology. To address this issue, this paper proposes a feature fusion method based on the Mixture of Experts, which extracts and integrates features relevant to fake audio detection from layer features, guided by a gating network based on the last layer feature, while freezing the pretrained model. Experiments conducted on the ASVspoof2019 and ASVspoof2021 datasets demonstrate that the proposed method achieves competitive performance compared to those requiring fine-tuning.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the long - training - time and computational - resource - consumption issues brought by fine - tuning when existing fake - audio - detection methods use pre - trained models. Specifically, most of the existing fusion methods rely on fine - tuning pre - trained models, which leads to an overly long training time and hinders the rapid iteration of the model when facing new speech - synthesis technologies. To solve this problem, the author proposes a feature - fusion method based on Mixture of Experts (MoE). This method freezes the pre - trained model (such as wav2vec 2.0), extracts features related to fake - audio - detection from its different layers, and uses a gating network based on the last - layer features to guide the dynamic fusion of these features. This method not only avoids the large amount of time and computational resources required for fine - tuning the pre - trained model but also can achieve competitive performance on multiple datasets. ### Main Contributions 1. **Propose MoE Fusion Module**: By freezing the pre - trained model and using its multi - layer features for dynamic fusion, efficient fake - audio - detection is achieved. 2. **Improve Generalization Ability**: Experimental results show that the proposed method performs better than or close to methods that require fine - tuning of pre - trained models on the ASVspoof2019 and ASVspoof2021 datasets. 3. **Reduce Computational Cost**: By freezing the pre - trained model, the number of training parameters is reduced, further reducing the computational cost. ### Method Overview - **Pre - trained Model**: Use wav2vec 2.0 as a feature extractor to obtain the context representation of the audio. - **MoE Fusion Module**: It includes a gating network and multiple expert networks. The gating network selects appropriate expert networks according to the last - layer features to process and fuse features of different layers. - **Classifier**: Finally, use the AASIST classifier to perform binary - classification prediction (true or false audio) on the fused features. In this way, the author has successfully solved the inefficiency problem brought by fine - tuning the pre - trained model in existing methods and provided a more efficient and effective fake - audio - detection scheme.

Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0

Ghost-in-Wave: How Speaker-Irrelative Features Interfere DeepFake Voice Detectors

End-to-end Spoofing Speech Detection and Knowledge Distillation under Noisy Conditions

Siamese Network with Wav2vec Feature for Spoofing Speech Detection

Audio Deepfake Detection with Self-Supervised WavLM and Multi-Fusion Attentive Classifier

Fully Automated End-to-End Fake Audio Detection.

Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

Leveraging Mixture of Experts for Improved Speech Deepfake Detection

Transferring Audio Deepfake Detection Capability Across Languages

Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models

Multi-perspective Information Fusion Res2Net with RandomSpecmix for Fake Speech Detection

Physiological-Physical Feature Fusion for Automatic Voice Spoofing Detection

Continual Learning for Fake Audio Detection

Adaptive Fake Audio Detection with Low-Rank Model Squeezing

Hybrid Transformer Architectures With Diverse Audio Features for Deepfake Speech Classification

Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection

Speaker Recognition-Assisted Robust Audio Deepfake Detection

Deepfake Detection System for the ADD Challenge Track 3.2 Based on Score Fusion

An explainable deepfake of speech detection method with spectrograms and waveforms

A robust audio deepfake detection system via multi-view feature

Fake Audio Detection Based On Unsupervised Pretraining Models