Abstract:Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited remarkable performance in generating realistic and natural audio. However, their dark side, audio deepfake poses a significant threat to both society and individuals. Existing countermeasures largely focus on determining the genuineness of speech based on complete original audio recordings, which however often contain private content. This oversight may refrain deepfake detection from many applications, particularly in scenarios involving sensitive information like business secrets. In this paper, we propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within. Our key idea is to devise a neural audio codec into a novel decoupling model that well separates the semantic and acoustic information from audio samples, and only use the acoustic information (e.g., prosody and timbre) for deepfake detection. In this way, no semantic content will be exposed to the detector. To overcome the challenge of identifying diverse deepfake audio without semantic clues, we enhance our deepfake detector with real-world codec augmentation. Extensive experiments conducted on four benchmark datasets demonstrate SafeEar's effectiveness in detecting various deepfake techniques with an equal error rate (EER) down to 2.02%. Simultaneously, it shields five-language speech content from being deciphered by both machine and human auditory analysis, demonstrated by word error rates (WERs) all above 93.93% and our user study. Furthermore, our benchmark constructed for anti-deepfake and anti-content recovery evaluation helps provide a basis for future research in the realms of audio privacy preservation and deepfake detection.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the privacy protection problem in audio deepfake detection. Specifically, existing audio deepfake detection methods usually rely on complete original audio recordings, which may contain private content and are thus restricted in many application scenarios (such as those involving commercial secrets). To solve this problem, the authors propose a new framework named SafeEar, which can detect audio deepfakes without accessing the voice content, thereby protecting user privacy. ### What are the main contributions of the paper? 1. **First attempt**: As far as the authors know, this is the first attempt to study and verify the feasibility of achieving audio deepfake detection while protecting the privacy of voice content. 2. **Proposing the SafeEar framework**: SafeEar is an innovative privacy - protecting deepfake detection framework. It decomposes voice information into semantic and acoustic information through a neural audio codec and uses only acoustic information for detection, ensuring content privacy. In addition, they have also developed an advanced detector that achieves effective deepfake detection based only on acoustic information. 3. **Constructing the CVoiceFake dataset**: The authors constructed a multilingual deepfake dataset, CVoiceFake, covering more than 1.25 million real and fake voice samples, and established a comprehensive benchmark test, focusing on deepfake detection and content privacy protection tasks. Experimental results show that SafeEar can effectively detect deep - fake audio under various influencing factors and resist various content restoration attacks. ### How does SafeEar work? The core idea of SafeEar is to decompose the voice signal into semantic information and acoustic information and use only acoustic information for deepfake detection. The specific steps are as follows: 1. **Front - end feature extraction**: Use a neural audio codec to decompose the audio signal \( X\in\mathbb{R}^{1\times T} \) into semantic tokens \( S\in\mathbb{R}^{C\times T_{n}} \) and acoustic tokens \( A\in\mathbb{R}^{7C\times T_{n}} \), where \( C \) represents the token dimension, and \( T \) and \( T_{n} \) represent the audio length and token length respectively. 2. **Bottleneck and shuffling layers**: Further protect the acoustic tokens through the bottleneck and shuffling layers so that they cannot be reconstructed into the original content. 3. **Back - end detector optimization**: The back - end detector is carefully tuned, including the optimal number of self - attention heads and training to simulate real - world codec conversions, to ensure reliable detection of various real - world deepfake audio. 4. **Codec - based decoupling model (CDM)**: Use multi - layer residual vector quantizers (RVQs) to separate the mixed voice tokens into independent semantic and acoustic tokens. The encoder - decoder architecture accurately reconstructs the original audio, and the RVQs equipped with HuBERT further decouple these features and hierarchically quantize them into discrete semantic and acoustic tokens. Through these designs, SafeEar can not only effectively detect deep - fake audio, but also prevent content restoration by machine and human auditory analysis, thereby protecting user privacy.

SafeEar: Content Privacy-Preserving Audio Deepfake Detection

Efficient Deepfake Audio Detection Using Spectro-Temporal Analysis and Deep Learning

Transferring Audio Deepfake Detection Capability Across Languages

FakeSound: Deepfake General Audio Detection

Speaker Recognition-Assisted Robust Audio Deepfake Detection

DeepSonar: Towards Effective and Robust Detection of AI-Synthesized Fake Voices

I Can Hear You: Selective Robust Training for Deepfake Audio Detection

The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio

Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models

AVSecure: an Audio-Visual Watermarking Framework for Proactive Deepfake Detection

Towards the Development of a Real-Time Deepfake Audio Detection System in Communication Platforms

Does Audio Deepfake Detection Generalize?

Multi-Scale Permutation Entropy for Audio Deepfake Detection

Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection

Deepfake audio detection by speaker verification

Audio Deepfake Detection with Self-Supervised WavLM and Multi-Fusion Attentive Classifier

Generalized Source Tracing: Detecting Novel Audio Deepfake Algorithm with Real Emphasis and Fake Dispersion Strategy

Retrieval-Augmented Audio Deepfake Detection

Vulnerability of Automatic Identity Recognition to Audio-Visual Deepfakes

Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

A lightweight feature extraction technique for deepfake audio detection