Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes

Kuiyuan Zhang,Zhongyun Hua,Rushi Lan,Yushu Zhang,Yifang Guo
2024-12-17
Abstract:Recent advancements in text-to-speech and speech conversion technologies have enabled the creation of highly convincing synthetic speech. While these innovations offer numerous practical benefits, they also cause significant security challenges when maliciously misused. Therefore, there is an urgent need to detect these synthetic speech signals. Phoneme features provide a powerful speech representation for deepfake detection. However, previous phoneme-based detection approaches typically focused on specific phonemes, overlooking temporal inconsistencies across the entire phoneme sequence. In this paper, we develop a new mechanism for detecting speech deepfakes by identifying the inconsistencies of phoneme-level speech features. We design an adaptive phoneme pooling technique that extracts sample-specific phoneme-level features from frame-level speech data. By applying this technique to features extracted by pre-trained audio models on previously unseen deepfake datasets, we demonstrate that deepfake samples often exhibit phoneme-level inconsistencies when compared to genuine speech. To further enhance detection accuracy, we propose a deepfake detector that uses a graph attention network to model the temporal dependencies of phoneme-level features. Additionally, we introduce a random phoneme substitution augmentation technique to increase feature diversity during training. Extensive experiments on four benchmark datasets demonstrate the superior performance of our method over existing state-of-the-art detection methods.
Sound,Artificial Intelligence,Audio and Speech Processing
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is: **Detecting highly realistic speech deepfakes**. With the progress of text - to - speech (TTS) and voice conversion technologies, the quality of synthetic speech is getting higher and higher. This not only brings many benefits in practical applications, but also poses significant security challenges, especially when these technologies are maliciously misused. Therefore, there is an urgent need to develop an effective method to distinguish real speech from synthetic deep - fake speech. ### Specific Problems and Solutions 1. **Limitations of Existing Methods**: - Existing phoneme - based detection methods usually only focus on specific phonemes and ignore the temporal inconsistency of the entire phoneme sequence. - These methods often need to extract specific phoneme sets for different datasets, which is time - consuming and has poor generalization ability. 2. **Proposed New Method**: - **Phoneme - level Feature Differences**: The author found that deep - fake speech has inconsistencies in phoneme - level features, and these inconsistencies can be used as reliable indicators for detection. - **Adaptive Phoneme Pooling Technique**: By converting frame - level speech features into sample - specific phoneme - level features, capture the unique configuration of each phoneme and the transitions between them. - **Graph Attention Network (GAT)**: Use GAT to model the temporal dependencies of phoneme - level features to further improve the detection accuracy. - **Random Phoneme Substitution Augmentation (RPSA)**: Increase the feature diversity during training by randomly replacing phonemes to improve the robustness of the model. ### Main Contributions - **Identifying Inconsistencies in Phoneme - level Features**: Generate phoneme - level features through adaptive phoneme pooling to reveal the differences between real samples and deep - fake samples. - **Constructing a Phoneme - based Deep - fake Speech Detection Model**: Combine a pre - trained phoneme recognition system and a GAT module, and introduce a data augmentation method. - **Comprehensive Evaluation**: Conduct extensive experiments on multiple benchmark datasets, showing performance superior to the existing state - of - the - art methods and verifying the effectiveness of each component. ### Experimental Results - In the cross - method evaluation on the ASVspoof2021 DF dataset, this method significantly outperforms other methods in almost all categories. - In the cross - language and cross - dataset evaluation, this method shows excellent generalization ability, especially on unseen languages and datasets. - In the robustness evaluation, this method can still maintain high performance under the influence of background noise and compression artifacts. Through these improvements, this paper provides a more effective deep - fake speech detection method that can meet the security challenges brought by increasingly complex speech synthesis technologies.