Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0

Zhiyong Wang,Ruibo Fu,Zhengqi Wen,Jianhua Tao,Xiaopeng Wang,Yuankun Xie,Xin Qi,Shuchen Shi,Yi Lu,Yukun Liu,Chenxing Li,Xuefei Liu,Guanjun Li
2024-09-18
Abstract:Speech synthesis technology has posed a serious threat to speaker verification systems. Currently, the most effective fake audio detection methods utilize pretrained models, and integrating features from various layers of pretrained model further enhances detection performance. However, most of the previously proposed fusion methods require fine-tuning the pretrained models, resulting in excessively long training times and hindering model iteration when facing new speech synthesis technology. To address this issue, this paper proposes a feature fusion method based on the Mixture of Experts, which extracts and integrates features relevant to fake audio detection from layer features, guided by a gating network based on the last layer feature, while freezing the pretrained model. Experiments conducted on the ASVspoof2019 and ASVspoof2021 datasets demonstrate that the proposed method achieves competitive performance compared to those requiring fine-tuning.
Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the long - training - time and computational - resource - consumption issues brought by fine - tuning when existing fake - audio - detection methods use pre - trained models. Specifically, most of the existing fusion methods rely on fine - tuning pre - trained models, which leads to an overly long training time and hinders the rapid iteration of the model when facing new speech - synthesis technologies. To solve this problem, the author proposes a feature - fusion method based on Mixture of Experts (MoE). This method freezes the pre - trained model (such as wav2vec 2.0), extracts features related to fake - audio - detection from its different layers, and uses a gating network based on the last - layer features to guide the dynamic fusion of these features. This method not only avoids the large amount of time and computational resources required for fine - tuning the pre - trained model but also can achieve competitive performance on multiple datasets. ### Main Contributions 1. **Propose MoE Fusion Module**: By freezing the pre - trained model and using its multi - layer features for dynamic fusion, efficient fake - audio - detection is achieved. 2. **Improve Generalization Ability**: Experimental results show that the proposed method performs better than or close to methods that require fine - tuning of pre - trained models on the ASVspoof2019 and ASVspoof2021 datasets. 3. **Reduce Computational Cost**: By freezing the pre - trained model, the number of training parameters is reduced, further reducing the computational cost. ### Method Overview - **Pre - trained Model**: Use wav2vec 2.0 as a feature extractor to obtain the context representation of the audio. - **MoE Fusion Module**: It includes a gating network and multiple expert networks. The gating network selects appropriate expert networks according to the last - layer features to process and fuse features of different layers. - **Classifier**: Finally, use the AASIST classifier to perform binary - classification prediction (true or false audio) on the fused features. In this way, the author has successfully solved the inefficiency problem brought by fine - tuning the pre - trained model in existing methods and provided a more efficient and effective fake - audio - detection scheme.