SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection

Yi Zhu,Surya Koppisetti,Trang Tran,Gaurav Bharaj
2024-07-26
Abstract:Audio deepfake detection (ADD) is crucial to combat the misuse of speech synthesized from generative AI models. Existing ADD models suffer from generalization issues, with a large performance discrepancy between in-domain and out-of-domain data. Moreover, the black-box nature of existing models limits their use in real-world scenarios, where explanations are required for model decisions. To alleviate these issues, we introduce a new ADD model that explicitly uses the StyleLInguistics Mismatch (SLIM) in fake speech to separate them from real speech. SLIM first employs self-supervised pretraining on only real samples to learn the style-linguistics dependency in the real class. The learned features are then used in complement with standard pretrained acoustic features (e.g., Wav2vec) to learn a classifier on the real and fake classes. When the feature encoders are frozen, SLIM outperforms benchmark methods on out-of-domain datasets while achieving competitive results on in-domain data. The features learned by SLIM allow us to quantify the (mis)match between style and linguistic content in a sample, hence facilitating an explanation of the model decision.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
This paper attempts to solve two main problems in Audio Deepfake Detection (ADD): generalization ability and model interpretability. ### Generalization Ability Problem The performance of existing ADD models drops significantly when dealing with unseen attacks (out - of - domain data). The specific manifestations are as follows: - **Performance gap between in - domain and out - of - domain data**: Existing models perform well on the training data set but poorly on unseen data sets. - **The black - box nature limits practical applications**: Most existing models are black - box models, and it is difficult to explain their decision - making processes, which is a major drawback in real - world scenarios where explanations are required. ### Model Interpretability Problem In order to improve the interpretability of the model, researchers need to understand the information on which the model relies when making predictions and in which cases the model may fail to successfully detect deep - fake audio. However, existing explanation methods such as visualization - based posterior analysis (e.g., saliency maps) are sensitive to training settings and have poor consistency. ### Proposed Solution To solve the above problems, the paper proposes a new ADD model - SLIM (Style - Linguistics Mismatch Model), which distinguishes real voices from fake voices by explicitly exploiting the style - linguistics mismatch in fake voices. The main contributions of SLIM include: 1. **Proposing the SLIM model**: This model uses the style - linguistics mismatch in deep - fake audio for generalization detection and has interpretability. 2. **Superior performance on out - of - domain data sets**: SLIM performs better than the existing state - of - the - art methods on out - of - domain data sets (such as In - the - wild, MLAAD), and is also competitive on in - domain data sets (such as ASVspoof2019, 2021). 3. **Enhanced interpretability**: Unlike black - box ADD models, the style - linguistics features learned by SLIM can be used to explain the model's decisions. The paper shows how to explain at the group and individual sample levels. ### Method Overview SLIM adopts a two - stage framework: - **First stage**: Self - supervised pre - training is performed using only real samples to learn style - linguistics dependencies. Dependent features are generated by comparing style and language subspace representations. - **Second stage**: Combining the dependent features learned in the first stage with the original style - linguistics representations, supervised training is carried out to train a lightweight projection head to classify the input representation as real or fake. ### Experimental Results The experimental results show that SLIM's detection performance on multiple data sets is better than existing methods, especially on out - of - domain data sets. In addition, the dependent features provide better generalization ability in distinguishing real and fake samples and help to explain the model's decision - making process. ### Summary By introducing the concept of style - linguistics mismatch, SLIM not only improves the generalization ability of audio deep - fake detection but also enhances the model's interpretability, thus better coping with the challenges in the real world.