One-Class Learning with Adaptive Centroid Shift for Audio Deepfake Detection

Hyun Myung Kim,Kangwook Jang,Hoirin Kim
2024-06-24
Abstract:As speech synthesis systems continue to make remarkable advances in recent years, the importance of robust deepfake detection systems that perform well in unseen systems has grown. In this paper, we propose a novel adaptive centroid shift (ACS) method that updates the centroid representation by continually shifting as the weighted average of bonafide representations. Our approach uses only bonafide samples to define their centroid, which can yield a specialized centroid for one-class learning. Integrating our ACS with one-class learning gathers bonafide representations into a single cluster, forming well-separated embeddings robust to unseen spoofing attacks. Our proposed method achieves an equal error rate (EER) of 2.19% on the ASVspoof 2021 deepfake dataset, outperforming all existing systems. Furthermore, the t-SNE visualization illustrates that our method effectively maps the bonafide embeddings into a single cluster and successfully disentangles the bonafide and spoof classes.
Audio and Speech Processing,Cryptography and Security,Sound
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address a significant challenge in Audio Deepfake Detection (ADD), which is how to improve the generalization ability of detection systems when faced with unseen synthesis systems. With the rapid development of speech synthesis systems, synthetic speech is becoming increasingly difficult to distinguish from real human speech, making it more challenging to detect unseen fake attacks. Existing detection systems show a significant performance drop when encountering unseen fake attacks, thus necessitating a new approach to enhance the generalization ability of detection systems. ### Specific Problems and Solutions 1. **Problem**: Existing audio deepfake detection methods perform poorly when faced with unseen synthesis systems. The main reason is that these methods usually assume that fake speech has a similar distribution, whereas in reality, the distributions of fake speech generated by different synthesis systems vary. 2. **Solution**: The authors propose a new Adaptive Centroid Shift (ACS) method for One-Class Learning. This method updates the centroid representation using only bonafide samples, thereby avoiding the influence of fake samples on the centroid. Specifically, the ACS method continuously updates the centroid representation as a weighted average of bonafide samples, forming a centroid specifically for bonafide samples. Combined with One-Class Learning, this method clusters bonafide samples into a single cluster and forms clearly separated feature representations in the embedding space, thereby improving robustness against unseen fake attacks. ### Experimental Results - **Datasets**: The paper conducts experiments on the ASVspoof 2021 and ASVspoof 2019 datasets. - **Evaluation Metrics**: The main evaluation metrics used are Equal Error Rate (EER) and Minimum Normalized Tandem Detection Cost Function (min t-DCF). - **Performance**: The proposed ACS method achieved an EER of 2.19% on the ASVspoof 2021 deepfake dataset, outperforming all existing systems. Additionally, t-SNE visualization results show that this method effectively maps bonafide samples into a single cluster and successfully separates bonafide and fake samples. ### Conclusion By proposing the ACS method, this paper significantly improves the generalization ability of audio deepfake detection systems, especially when facing unseen fake attacks. This method not only achieves the best performance on multiple datasets but also demonstrates its potential in practical applications.