Speech Formants Integration for Generalized Detection of Synthetic Speech Spoofing Attacks

Kexu Liu,Yuanxin Wang,Shengchen Li,Xi Shao
DOI: https://doi.org/10.21437/interspeech.2024-292
2024-01-01
Abstract:Existing synthetic speech detection systems struggle with high variance of performance in different spoofing attacks. This is due to the diversity of unseen synthesis algorithms, making it challenging for the system to generalize unseen spoofing attacks. To address this, we propose multi-view features with one-class learning for synthetic speech detection. The key idea is to capture bona-fide speech features from dynamic information of formants and XLS-R dimensions, aiming to compactly represent bona-fide speech in the embedding space without the need to fit various unseen spoofing attacks. To leverage multi-view features, the dynamic information of formants is integrated with XLS-R features using a parallel attention mechanism and gating modulation. Our system achieves an equal error rate (EER) of 0.39% in the ASVspoof 2019 logical access scenario, demonstrating a low performance variance of 0.069 across all 13 attacks, outperforming most mainstream single-systems.
What problem does this paper attempt to address?