Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models

Alessandro Pianese,Davide Cozzolino,Giovanni Poggi,Luisa Verdoliva

DOI: https://doi.org/10.1145/3658664.3659662

2024-07-01

Abstract:Generalization is a main issue for current audio deepfake detectors, which struggle to provide reliable results on out-of-distribution data. Given the speed at which more and more accurate synthesis methods are developed, it is very important to design techniques that work well also on data they were not trained for. In this paper we study the potential of large-scale pre-trained models for audio deepfake detection, with special focus on generalization ability. To this end, the detection problem is reformulated in a speaker verification framework and fake audios are exposed by the mismatch between the voice sample under test and the voice of the claimed identity. With this paradigm, no fake speech sample is necessary in training, cutting off any link with the generation method at the root, and ensuring full generalization ability. Features are extracted by general-purpose large pre-trained models, with no need for training or fine-tuning on specific fake detection or speaker verification datasets. At detection time only a limited set of voice fragments of the identity under test is required. Experiments on several datasets widespread in the community show that detectors based on pre-trained models achieve excellent performance and show strong generalization ability, rivaling supervised methods on in-distribution data and largely overcoming them on out-of-distribution data.

Sound,Computer Vision and Pattern Recognition,Audio and Speech Processing

What problem does this paper attempt to address?

The paper aims to address the generalization problem in audio deepfake detection. Specifically, current audio deepfake detectors struggle to provide reliable results when dealing with out-of-distribution data. With the rapid development of synthesis methods, it has become particularly important to design techniques that can effectively detect unseen data. The paper proposes a method based on large-scale pre-trained models to detect audio deepfakes, with a special focus on its generalization ability. By reframing the detection problem within a speaker verification framework, the method leverages the mismatch between the test speech sample and its claimed identity to identify fake audio. This approach does not require the use of any fake speech samples during training, thereby severing ties with the generation methods and ensuring complete generalization capability. Additionally, the method extracts features from a general large-scale pre-trained model, eliminating the need for training or fine-tuning on specific fake detection or speaker verification datasets. Experimental results show that the pre-trained model-based detector performs excellently across multiple datasets and has strong generalization ability, comparable to supervised methods on in-distribution data and significantly outperforming supervised methods on out-of-distribution data.

Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models

Deepfake audio detection by speaker verification

Transferring Audio Deepfake Detection Capability Across Languages

Ghost-in-Wave: How Speaker-Irrelative Features Interfere DeepFake Voice Detectors

Does Audio Deepfake Detection Generalize?

Towards generalizing deep-audio fake detection networks

Voice-Face Homogeneity Tells Deepfake

Towards generalisable and calibrated synthetic speech detection with self-supervised representations

Deepfake Detection without Deepfakes: Generalization via Synthetic Frequency Patterns Injection

I Can Hear You: Selective Robust Training for Deepfake Audio Detection

Enhancing Generalization in Audio Deepfake Detection: A Neural Collapse based Sampling and Training Approach

Combining Automatic Speaker Verification and Prosody Analysis for Synthetic Speech Detection

How Generalizable are Deepfake Image Detectors? An Empirical Study

A robust audio deepfake detection system via multi-view feature

Speaker Recognition-Assisted Robust Audio Deepfake Detection

FakeSound: Deepfake General Audio Detection

Audio-Visual Contrastive Pre-train for Face Forgery Detection

Generalized Fake Audio Detection via Deep Stable Learning

Leveraging Mixture of Experts for Improved Speech Deepfake Detection

Efficient Deepfake Audio Detection Using Spectro-Temporal Analysis and Deep Learning

Learning A Self-Supervised Domain-Invariant Feature Representation for Generalized Audio Deepfake Detection