Abstract:As speech synthesis systems continue to make remarkable advances in recent years, the importance of robust deepfake detection systems that perform well in unseen systems has grown. In this paper, we propose a novel adaptive centroid shift (ACS) method that updates the centroid representation by continually shifting as the weighted average of bonafide representations. Our approach uses only bonafide samples to define their centroid, which can yield a specialized centroid for one-class learning. Integrating our ACS with one-class learning gathers bonafide representations into a single cluster, forming well-separated embeddings robust to unseen spoofing attacks. Our proposed method achieves an equal error rate (EER) of 2.19% on the ASVspoof 2021 deepfake dataset, outperforming all existing systems. Furthermore, the t-SNE visualization illustrates that our method effectively maps the bonafide embeddings into a single cluster and successfully disentangles the bonafide and spoof classes.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address a significant challenge in Audio Deepfake Detection (ADD), which is how to improve the generalization ability of detection systems when faced with unseen synthesis systems. With the rapid development of speech synthesis systems, synthetic speech is becoming increasingly difficult to distinguish from real human speech, making it more challenging to detect unseen fake attacks. Existing detection systems show a significant performance drop when encountering unseen fake attacks, thus necessitating a new approach to enhance the generalization ability of detection systems. ### Specific Problems and Solutions 1. **Problem**: Existing audio deepfake detection methods perform poorly when faced with unseen synthesis systems. The main reason is that these methods usually assume that fake speech has a similar distribution, whereas in reality, the distributions of fake speech generated by different synthesis systems vary. 2. **Solution**: The authors propose a new Adaptive Centroid Shift (ACS) method for One-Class Learning. This method updates the centroid representation using only bonafide samples, thereby avoiding the influence of fake samples on the centroid. Specifically, the ACS method continuously updates the centroid representation as a weighted average of bonafide samples, forming a centroid specifically for bonafide samples. Combined with One-Class Learning, this method clusters bonafide samples into a single cluster and forms clearly separated feature representations in the embedding space, thereby improving robustness against unseen fake attacks. ### Experimental Results - **Datasets**: The paper conducts experiments on the ASVspoof 2021 and ASVspoof 2019 datasets. - **Evaluation Metrics**: The main evaluation metrics used are Equal Error Rate (EER) and Minimum Normalized Tandem Detection Cost Function (min t-DCF). - **Performance**: The proposed ACS method achieved an EER of 2.19% on the ASVspoof 2021 deepfake dataset, outperforming all existing systems. Additionally, t-SNE visualization results show that this method effectively maps bonafide samples into a single cluster and successfully separates bonafide and fake samples. ### Conclusion By proposing the ACS method, this paper significantly improves the generalization ability of audio deepfake detection systems, especially when facing unseen fake attacks. This method not only achieves the best performance on multiple datasets but also demonstrates its potential in practical applications.

One-Class Learning with Adaptive Centroid Shift for Audio Deepfake Detection

Securing Voice Biometrics: One-Shot Learning Approach for Audio Deepfake Detection

One-class Learning Towards Synthetic Voice Spoofing Detection

A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection

Advancing Continual Learning for Robust Deepfake Audio Classification

Transferring Audio Deepfake Detection Capability Across Languages

Audio Deepfake Detection with Self-Supervised WavLM and Multi-Fusion Attentive Classifier

Speaker Recognition-Assisted Robust Audio Deepfake Detection

Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models

An Efficient Temporary Deepfake Location Approach Based Embeddings for Partially Spoofed Audio Detection

I Can Hear You: Selective Robust Training for Deepfake Audio Detection

Efficient Deepfake Audio Detection Using Spectro-Temporal Analysis and Deep Learning

Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection

SAMO: Speaker Attractor Multi-Center One-Class Learning for Voice Anti-Spoofing

HM-Conformer: A Conformer-based audio deepfake detection system with hierarchical pooling and multi-level classification token aggregation methods

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation

A blended framework for audio spoof detection with sequential models and bags of auditory bites

MelCochleaGram-DeepCNN: Sequentially Fused Spectrogram and the DeepCNN Classifiers-based Audio Spoof Detection System

Straight Through Gumbel Softmax Estimator based Bimodal Neural Architecture Search for Audio-Visual Deepfake Detection

Audio Deepfake Attribution: An Initial Dataset and Investigation

Audio-deepfake detection: Adversarial attacks and countermeasures