Abstract:Recent advancements in text-to-speech and speech conversion technologies have enabled the creation of highly convincing synthetic speech. While these innovations offer numerous practical benefits, they also cause significant security challenges when maliciously misused. Therefore, there is an urgent need to detect these synthetic speech signals. Phoneme features provide a powerful speech representation for deepfake detection. However, previous phoneme-based detection approaches typically focused on specific phonemes, overlooking temporal inconsistencies across the entire phoneme sequence. In this paper, we develop a new mechanism for detecting speech deepfakes by identifying the inconsistencies of phoneme-level speech features. We design an adaptive phoneme pooling technique that extracts sample-specific phoneme-level features from frame-level speech data. By applying this technique to features extracted by pre-trained audio models on previously unseen deepfake datasets, we demonstrate that deepfake samples often exhibit phoneme-level inconsistencies when compared to genuine speech. To further enhance detection accuracy, we propose a deepfake detector that uses a graph attention network to model the temporal dependencies of phoneme-level features. Additionally, we introduce a random phoneme substitution augmentation technique to increase feature diversity during training. Extensive experiments on four benchmark datasets demonstrate the superior performance of our method over existing state-of-the-art detection methods.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is: **Detecting highly realistic speech deepfakes**. With the progress of text - to - speech (TTS) and voice conversion technologies, the quality of synthetic speech is getting higher and higher. This not only brings many benefits in practical applications, but also poses significant security challenges, especially when these technologies are maliciously misused. Therefore, there is an urgent need to develop an effective method to distinguish real speech from synthetic deep - fake speech. ### Specific Problems and Solutions 1. **Limitations of Existing Methods**: - Existing phoneme - based detection methods usually only focus on specific phonemes and ignore the temporal inconsistency of the entire phoneme sequence. - These methods often need to extract specific phoneme sets for different datasets, which is time - consuming and has poor generalization ability. 2. **Proposed New Method**: - **Phoneme - level Feature Differences**: The author found that deep - fake speech has inconsistencies in phoneme - level features, and these inconsistencies can be used as reliable indicators for detection. - **Adaptive Phoneme Pooling Technique**: By converting frame - level speech features into sample - specific phoneme - level features, capture the unique configuration of each phoneme and the transitions between them. - **Graph Attention Network (GAT)**: Use GAT to model the temporal dependencies of phoneme - level features to further improve the detection accuracy. - **Random Phoneme Substitution Augmentation (RPSA)**: Increase the feature diversity during training by randomly replacing phonemes to improve the robustness of the model. ### Main Contributions - **Identifying Inconsistencies in Phoneme - level Features**: Generate phoneme - level features through adaptive phoneme pooling to reveal the differences between real samples and deep - fake samples. - **Constructing a Phoneme - based Deep - fake Speech Detection Model**: Combine a pre - trained phoneme recognition system and a GAT module, and introduce a data augmentation method. - **Comprehensive Evaluation**: Conduct extensive experiments on multiple benchmark datasets, showing performance superior to the existing state - of - the - art methods and verifying the effectiveness of each component. ### Experimental Results - In the cross - method evaluation on the ASVspoof2021 DF dataset, this method significantly outperforms other methods in almost all categories. - In the cross - language and cross - dataset evaluation, this method shows excellent generalization ability, especially on unseen languages and datasets. - In the robustness evaluation, this method can still maintain high performance under the influence of background noise and compression artifacts. Through these improvements, this paper provides a more effective deep - fake speech detection method that can meet the security challenges brought by increasingly complex speech synthesis technologies.

Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes

Ghost-in-Wave: How Speaker-Irrelative Features Interfere DeepFake Voice Detectors

Transferring Audio Deepfake Detection Capability Across Languages

Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation

Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection

Audio Deepfake Detection with Self-Supervised WavLM and Multi-Fusion Attentive Classifier

Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

Speaker Recognition-Assisted Robust Audio Deepfake Detection

Voice-Face Homogeneity Tells Deepfake

I Can Hear You: Selective Robust Training for Deepfake Audio Detection

A Comparative Study on Physical and Perceptual Features for Deepfake Audio Detection

Does Audio Deepfake Detection Generalize?

An explainable deepfake of speech detection method with spectrograms and waveforms

A lightweight feature extraction technique for deepfake audio detection

Can DeepFake Speech be Reliably Detected?

SafeEar: Content Privacy-Preserving Audio Deepfake Detection

Efficient Deepfake Audio Detection Using Spectro-Temporal Analysis and Deep Learning

Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models

Discriminative Feature Decoupling Enhancement for Speech Forgery Detection

Acoustic features analysis for explainable machine learning-based audio spoofing detection

DeepSonar: Towards Effective and Robust Detection of AI-Synthesized Fake Voices