Audio Multi-view Spoofing Detection Framework Based on Audio-Text-Emotion Correlations
Junyan Wu,Qilin Yin,Ziqi Sheng,Wei Lu,Jiwu Huang,Bin Li
DOI: https://doi.org/10.1109/tifs.2024.3431888
IF: 7.231
2024-01-01
IEEE Transactions on Information Forensics and Security
Abstract:In recent years, audio spoofing detection has received widespread attention for protecting personal privacy and social security. Despite the significant progress achieved in audio single-view spoofing detection, challenges remain with regard to addressing unknown spoofing attacks in realistic scenarios. To solve these challenging problems, in this paper, we introduce a novel audio multi-view spoofing detection framework (AMSDF), whose goal is to capture both intra-view and inter-view cues by measuring correlations within audio multi-view features (i.e., audio-emotion-text) for audio spoofing detection. In general, different view features are inherently interconnected in the real patterns, while they may present unnatural correlations in the spoofing patterns. Therefore, more discriminative cues can be mined by utilizing their complex interactions, which is beneficial to the audio spoofing detection task. To this end, an intra-view graph attention mechanism (IGAM) is first utilized to aggregate each intra-view node within the same view. Subsequently, a heterogeneous graph fusion module (HGFM) is applied to measure correlations within inter-view nodes, which are enhanced with a master node for comprehensive analysis purposes. Finally, a group-based readout scheme (GRS) is designed to capture and preserve the most distinctive cues by leveraging the strengths of different feature sets, thereby effectively distinguishing subtle differences between real and spoofing audio. The experimental results show that our proposed framework can achieve better performance than that of the state-of-the-art methods, especially in realistic scenarios. The code and pre-trained models are available at https://github.com/ItzJuny/AMSDF.