SafeEar: Content Privacy-Preserving Audio Deepfake Detection

Xinfeng Li,Kai Li,Yifan Zheng,Chen Yan,Xiaoyu Ji,Wenyuan Xu
2024-09-14
Abstract:Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited remarkable performance in generating realistic and natural audio. However, their dark side, audio deepfake poses a significant threat to both society and individuals. Existing countermeasures largely focus on determining the genuineness of speech based on complete original audio recordings, which however often contain private content. This oversight may refrain deepfake detection from many applications, particularly in scenarios involving sensitive information like business secrets. In this paper, we propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within. Our key idea is to devise a neural audio codec into a novel decoupling model that well separates the semantic and acoustic information from audio samples, and only use the acoustic information (e.g., prosody and timbre) for deepfake detection. In this way, no semantic content will be exposed to the detector. To overcome the challenge of identifying diverse deepfake audio without semantic clues, we enhance our deepfake detector with real-world codec augmentation. Extensive experiments conducted on four benchmark datasets demonstrate SafeEar's effectiveness in detecting various deepfake techniques with an equal error rate (EER) down to 2.02%. Simultaneously, it shields five-language speech content from being deciphered by both machine and human auditory analysis, demonstrated by word error rates (WERs) all above 93.93% and our user study. Furthermore, our benchmark constructed for anti-deepfake and anti-content recovery evaluation helps provide a basis for future research in the realms of audio privacy preservation and deepfake detection.
Cryptography and Security,Artificial Intelligence,Multimedia,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the privacy protection problem in audio deepfake detection. Specifically, existing audio deepfake detection methods usually rely on complete original audio recordings, which may contain private content and are thus restricted in many application scenarios (such as those involving commercial secrets). To solve this problem, the authors propose a new framework named SafeEar, which can detect audio deepfakes without accessing the voice content, thereby protecting user privacy. ### What are the main contributions of the paper? 1. **First attempt**: As far as the authors know, this is the first attempt to study and verify the feasibility of achieving audio deepfake detection while protecting the privacy of voice content. 2. **Proposing the SafeEar framework**: SafeEar is an innovative privacy - protecting deepfake detection framework. It decomposes voice information into semantic and acoustic information through a neural audio codec and uses only acoustic information for detection, ensuring content privacy. In addition, they have also developed an advanced detector that achieves effective deepfake detection based only on acoustic information. 3. **Constructing the CVoiceFake dataset**: The authors constructed a multilingual deepfake dataset, CVoiceFake, covering more than 1.25 million real and fake voice samples, and established a comprehensive benchmark test, focusing on deepfake detection and content privacy protection tasks. Experimental results show that SafeEar can effectively detect deep - fake audio under various influencing factors and resist various content restoration attacks. ### How does SafeEar work? The core idea of SafeEar is to decompose the voice signal into semantic information and acoustic information and use only acoustic information for deepfake detection. The specific steps are as follows: 1. **Front - end feature extraction**: Use a neural audio codec to decompose the audio signal \( X\in\mathbb{R}^{1\times T} \) into semantic tokens \( S\in\mathbb{R}^{C\times T_{n}} \) and acoustic tokens \( A\in\mathbb{R}^{7C\times T_{n}} \), where \( C \) represents the token dimension, and \( T \) and \( T_{n} \) represent the audio length and token length respectively. 2. **Bottleneck and shuffling layers**: Further protect the acoustic tokens through the bottleneck and shuffling layers so that they cannot be reconstructed into the original content. 3. **Back - end detector optimization**: The back - end detector is carefully tuned, including the optimal number of self - attention heads and training to simulate real - world codec conversions, to ensure reliable detection of various real - world deepfake audio. 4. **Codec - based decoupling model (CDM)**: Use multi - layer residual vector quantizers (RVQs) to separate the mixed voice tokens into independent semantic and acoustic tokens. The encoder - decoder architecture accurately reconstructs the original audio, and the RVQs equipped with HuBERT further decouple these features and hierarchically quantize them into discrete semantic and acoustic tokens. Through these designs, SafeEar can not only effectively detect deep - fake audio, but also prevent content restoration by machine and human auditory analysis, thereby protecting user privacy.