Comparative Analysis of ASR Methods for Speech Deepfake Detection

Davide Salvi,Amit Kumar Singh Yadav,Kratika Bhagtani,Viola Negroni,Paolo Bestagini,Edward J. Delp
2024-11-26
Abstract:Recent techniques for speech deepfake detection often rely on pre-trained self-supervised models. These systems, initially developed for Automatic Speech Recognition (ASR), have proved their ability to offer a meaningful representation of speech signals, which can benefit various tasks, including deepfake detection. In this context, pre-trained models serve as feature extractors and are used to extract embeddings from input speech, which are then fed to a binary speech deepfake detector. The remarkable accuracy achieved through this approach underscores a potential relationship between ASR and speech deepfake detection. However, this connection is not yet entirely clear, and we do not know whether improved performance in ASR corresponds to higher speech deepfake detection capabilities. In this paper, we address this question through a systematic analysis. We consider two different pre-trained self-supervised ASR models, Whisper and Wav2Vec 2.0, and adapt them for the speech deepfake detection task. These models have been released in multiple versions, with increasing number of parameters and enhanced ASR performance. We investigate whether performance improvements in ASR correlate with improvements in speech deepfake detection. Our results provide insights into the relationship between these two tasks and offer valuable guidance for the development of more effective speech deepfake detectors.
Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to explore whether the improvement in automatic speech recognition (ASR) performance can be translated into better voice deep - fake detection capabilities. Specifically, the authors answer the following questions through systematic analysis: In the voice deep - fake detection task, when using a pre - trained self - supervised ASR model as a feature extractor, is the improvement in ASR performance associated with the enhancement of voice deep - fake detection capabilities? ### Background and Motivation In recent years, the development of voice deep - fake technology has made it increasingly easy to generate highly realistic synthetic voices, which brings potential risks to society. Therefore, it is crucial to develop effective voice deep - fake detection tools. Existing research shows that pre - trained self - supervised ASR models perform well in voice deep - fake detection, but it is not yet clear whether the performance improvement of these models on ASR tasks directly leads to an improvement in their performance in deep - fake detection. ### Research Objectives To explore this issue, the authors selected two well - known pre - trained ASR models - Whisper and Wav2Vec 2.0, and verified the relationship between ASR performance and deep - fake detection capabilities through multiple versions of experiments. These two models have different versions under different parameter scales and training strategies, allowing the authors to systematically evaluate whether the improvement in ASR performance can be translated into better deep - fake detection effects. ### Experimental Design 1. **Dataset**: The study used the Logical Access (LA) part of the ASVspoof 2019 dataset, which contains real voices and synthetic voices generated by different algorithms. 2. **Model Architecture**: - **Embedding Extractor**: A pre - trained ASR model (Whisper or Wav2Vec 2.0) was adopted, and the parameters of these models were frozen. - **Classifier**: Consisted of a fully - connected network (FC) used to map the extracted features to deep - fake detection results. 3. **Experimental Procedure**: - Use the training set and validation set for model training and tuning. - Use the test set to evaluate the model's detection performance in an open set. ### Conclusions Through systematic experiments and analysis, the authors hope to reveal the relationship between ASR performance and deep - fake detection capabilities, providing valuable guidance for the future development of more effective voice deep - fake detection tools. ### Formula Representation The formulas involved in the paper are mainly concentrated on the model structure and loss function. For example, the contrastive loss function of Wav2Vec 2.0 can be represented as: \[ \mathcal{L}_{\text{contrastive}} = -\log \frac{\exp(\text{sim}(z_i, z_j) / \tau)}{\sum_{k = 1}^{N} \mathbf{1}[k \neq i] \exp(\text{sim}(z_i, z_k) / \tau)} \] where \( \text{sim}(z_i, z_j) \) represents the similarity between two feature vectors \( z_i \) and \( z_j \), \( \tau \) is the temperature parameter, and \( N \) is the batch size. In this way, the authors ensure the rigor and scientific nature of the research and also provide a reference for subsequent research.