Abstract:Audio deepfake detection (ADD) is crucial to combat the misuse of speech synthesized from generative AI models. Existing ADD models suffer from generalization issues, with a large performance discrepancy between in-domain and out-of-domain data. Moreover, the black-box nature of existing models limits their use in real-world scenarios, where explanations are required for model decisions. To alleviate these issues, we introduce a new ADD model that explicitly uses the StyleLInguistics Mismatch (SLIM) in fake speech to separate them from real speech. SLIM first employs self-supervised pretraining on only real samples to learn the style-linguistics dependency in the real class. The learned features are then used in complement with standard pretrained acoustic features (e.g., Wav2vec) to learn a classifier on the real and fake classes. When the feature encoders are frozen, SLIM outperforms benchmark methods on out-of-domain datasets while achieving competitive results on in-domain data. The features learned by SLIM allow us to quantify the (mis)match between style and linguistic content in a sample, hence facilitating an explanation of the model decision.

What problem does this paper attempt to address?

This paper attempts to solve two main problems in Audio Deepfake Detection (ADD): generalization ability and model interpretability. ### Generalization Ability Problem The performance of existing ADD models drops significantly when dealing with unseen attacks (out - of - domain data). The specific manifestations are as follows: - **Performance gap between in - domain and out - of - domain data**: Existing models perform well on the training data set but poorly on unseen data sets. - **The black - box nature limits practical applications**: Most existing models are black - box models, and it is difficult to explain their decision - making processes, which is a major drawback in real - world scenarios where explanations are required. ### Model Interpretability Problem In order to improve the interpretability of the model, researchers need to understand the information on which the model relies when making predictions and in which cases the model may fail to successfully detect deep - fake audio. However, existing explanation methods such as visualization - based posterior analysis (e.g., saliency maps) are sensitive to training settings and have poor consistency. ### Proposed Solution To solve the above problems, the paper proposes a new ADD model - SLIM (Style - Linguistics Mismatch Model), which distinguishes real voices from fake voices by explicitly exploiting the style - linguistics mismatch in fake voices. The main contributions of SLIM include: 1. **Proposing the SLIM model**: This model uses the style - linguistics mismatch in deep - fake audio for generalization detection and has interpretability. 2. **Superior performance on out - of - domain data sets**: SLIM performs better than the existing state - of - the - art methods on out - of - domain data sets (such as In - the - wild, MLAAD), and is also competitive on in - domain data sets (such as ASVspoof2019, 2021). 3. **Enhanced interpretability**: Unlike black - box ADD models, the style - linguistics features learned by SLIM can be used to explain the model's decisions. The paper shows how to explain at the group and individual sample levels. ### Method Overview SLIM adopts a two - stage framework: - **First stage**: Self - supervised pre - training is performed using only real samples to learn style - linguistics dependencies. Dependent features are generated by comparing style and language subspace representations. - **Second stage**: Combining the dependent features learned in the first stage with the original style - linguistics representations, supervised training is carried out to train a lightweight projection head to classify the input representation as real or fake. ### Experimental Results The experimental results show that SLIM's detection performance on multiple data sets is better than existing methods, especially on out - of - domain data sets. In addition, the dependent features provide better generalization ability in distinguishing real and fake samples and help to explain the model's decision - making process. ### Summary By introducing the concept of style - linguistics mismatch, SLIM not only improves the generalization ability of audio deep - fake detection but also enhances the model's interpretability, thus better coping with the challenges in the real world.

SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection

Ghost-in-Wave: How Speaker-Irrelative Features Interfere DeepFake Voice Detectors

A robust audio deepfake detection system via multi-view feature

Transferring Audio Deepfake Detection Capability Across Languages

Learn from Real: Reality Defender's Submission to ASVspoof5 Challenge

Cross-Domain Audio Deepfake Detection: Dataset and Analysis

Efficient Deepfake Audio Detection Using Spectro-Temporal Analysis and Deep Learning

FakeSound: Deepfake General Audio Detection

TranssionADD: A multi-frame reinforcement based sequence tagging model for audio deepfake detection

The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio

Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio

Speaker Recognition-Assisted Robust Audio Deepfake Detection

DFADD: The Diffusion and Flow-Matching Based Audio Deepfake Dataset

A Comparative Study on Physical and Perceptual Features for Deepfake Audio Detection

Audio Deepfake Detection with Self-Supervised WavLM and Multi-Fusion Attentive Classifier

ADD 2023: Towards Audio Deepfake Detection and Analysis in the Wild

Learning A Self-Supervised Domain-Invariant Feature Representation for Generalized Audio Deepfake Detection

Heterogeneity over Homogeneity: Investigating Multilingual Speech Pre-Trained Models for Detecting Audio Deepfake

DiMoDif: Discourse Modality-information Differentiation for Audio-visual Deepfake Detection and Localization

Generalized Fake Audio Detection via Deep Stable Learning

A Unified Framework for Modality-Agnostic Deepfakes Detection