Abstract:Text-to-Speech (TTS) and Voice Conversion (VC) models have exhibited remarkable performance in generating realistic and natural audio. However, their dark side, audio deepfake poses a significant threat to both society and individuals. Existing countermeasures largely focus on determining the genuineness of speech based on complete original audio recordings, which however often contain private content. This oversight may refrain deepfake detection from many applications, particularly in scenarios involving sensitive information like business secrets. In this paper, we propose SafeEar, a novel framework that aims to detect deepfake audios without relying on accessing the speech content within. Our key idea is to devise a neural audio codec into a novel decoupling model that well separates the semantic and acoustic information from audio samples, and only use the acoustic information (e.g., prosody and timbre) for deepfake detection. In this way, no semantic content will be exposed to the detector. To overcome the challenge of identifying diverse deepfake audio without semantic clues, we enhance our deepfake detector with real-world codec augmentation. Extensive experiments conducted on four benchmark datasets demonstrate SafeEar's effectiveness in detecting various deepfake techniques with an equal error rate (EER) down to 2.02%. Simultaneously, it shields five-language speech content from being deciphered by both machine and human auditory analysis, demonstrated by word error rates (WERs) all above 93.93% and our user study. Furthermore, our benchmark constructed for anti-deepfake and anti-content recovery evaluation helps provide a basis for future research in the realms of audio privacy preservation and deepfake detection.

Multi-Scale Permutation Entropy for Audio Deepfake Detection

Ghost-in-Wave: How Speaker-Irrelative Features Interfere DeepFake Voice Detectors

Audio Deepfake Detection with Self-Supervised WavLM and Multi-Fusion Attentive Classifier

SafeEar: Content Privacy-Preserving Audio Deepfake Detection

MelCochleaGram-DeepCNN: Sequentially Fused Spectrogram and the DeepCNN Classifiers-based Audio Spoof Detection System

Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection

A blended framework for audio spoof detection with sequential models and bags of auditory bites

Transferring Audio Deepfake Detection Capability Across Languages

Speaker Recognition-Assisted Robust Audio Deepfake Detection

A lightweight feature extraction technique for deepfake audio detection

A Comparative Study on Physical and Perceptual Features for Deepfake Audio Detection

Acoustic features analysis for explainable machine learning-based audio spoofing detection

MFAAN: Unveiling Audio Deepfakes with a Multi-Feature Authenticity Network

Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models

Advancing Continual Learning for Robust Deepfake Audio Classification

A robust audio deepfake detection system via multi-view feature

Efficient Deepfake Audio Detection Using Spectro-Temporal Analysis and Deep Learning

Audio-deepfake detection: Adversarial attacks and countermeasures

Heterogeneity over Homogeneity: Investigating Multilingual Speech Pre-Trained Models for Detecting Audio Deepfake

Securing Voice Biometrics: One-Shot Learning Approach for Audio Deepfake Detection