I Can Hear You: Selective Robust Training for Deepfake Audio Detection

Zirui Zhang,Wei Hao,Aroon Sankoh,William Lin,Emanuel Mendiola-Ortiz,Junfeng Yang,Chengzhi Mao
2024-11-01
Abstract:Recent advances in AI-generated voices have intensified the challenge of detecting deepfake audio, posing risks for scams and the spread of disinformation. To tackle this issue, we establish the largest public voice dataset to date, named DeepFakeVox-HQ, comprising 1.3 million samples, including 270,000 high-quality deepfake samples from 14 diverse sources. Despite previously reported high accuracy, existing deepfake voice detectors struggle with our diversely collected dataset, and their detection success rates drop even further under realistic corruptions and adversarial attacks. We conduct a holistic investigation into factors that enhance model robustness and show that incorporating a diversified set of voice augmentations is beneficial. Moreover, we find that the best detection models often rely on high-frequency features, which are imperceptible to humans and can be easily manipulated by an attacker. To address this, we propose the F-SAT: Frequency-Selective Adversarial Training method focusing on high-frequency components. Empirical results demonstrate that using our training dataset boosts baseline model performance (without robust training) by 33%, and our robust training further improves accuracy by 7.7% on clean samples and by 29.3% on corrupted and attacked samples, over the state-of-the-art RawNet3 model.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve several key problems in current deepfake audio detection: 1. **Limitations of existing data sets**: - Existing deepfake audio data sets are usually small in scale, lack diversity, are too single - minded and out - of - date, resulting in poor generalization ability of the trained models when facing diverse deepfake audios in the real world. - These data sets cannot fully represent the latest AI voice synthesis technologies, making it difficult for models to deal with newly emerging deepfake samples. 2. **Poor performance of models in real - world scenarios**: - Although some existing deepfake audio detection models report high accuracy on public data sets, in practical applications, especially when facing real - world noise and adversarial attacks, the performance of these models drops significantly. - The models' reliance on high - frequency features makes them vulnerable to small perturbations that are imperceptible to human hearing. 3. **Vulnerability to adversarial attacks**: - Deep learning models are particularly vulnerable to adversarial attacks in the audio field. Attackers can mislead the model to make misclassifications by making slight modifications to the audio input, and these modifications are imperceptible to human hearing. To solve these problems, the authors propose the following methods: - **Construct a large - scale, diverse and high - quality deepfake audio data set**: Created a data set named DeepFakeVox - HQ, which contains 1.3 million samples, of which 270,000 are high - quality deepfake samples from 14 different sources. This data set is not only large in scale but also covers multiple types of deepfake audios and can better reflect the situation in the real world. - **Propose the Frequency - Selective Adversarial Training (F - SAT) method**: Aiming at the problem of deepfake audio detection models' reliance on high - frequency features, the authors introduce a new adversarial training method - F - SAT. This method focuses on specific frequency bands in the frequency domain, and through adversarial training, it enhances the robustness of the model to high - frequency features without affecting the real information of low - frequency features. - **Enhance the robustness and generalization ability of the model**: By introducing random audio enhancement techniques and F - SAT in the training process, the model not only improves its accuracy on clean data, but also shows stronger robustness when facing various types of noise and adversarial attacks. In general, this paper aims to improve the performance and robustness of deepfake audio detection models in real - world scenarios by constructing large - scale high - quality data sets and innovative adversarial training methods, so as to better deal with the security risks brought by deepfake audio.