Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectrograms

Penghui Wen,Kun Hu,Wenxi Yue,Sen Zhang,Wanlei Zhou,Zhiyong Wang
2023-08-18
Abstract:Robust audio anti-spoofing has been increasingly challenging due to the recent advancements on deepfake techniques. While spectrograms have demonstrated their capability for anti-spoofing, complementary information presented in multi-order spectral patterns have not been well explored, which limits their effectiveness for varying spoofing attacks. Therefore, we propose a novel deep learning method with a spectral fusion-reconstruction strategy, namely S2pecNet, to utilise multi-order spectral patterns for robust audio anti-spoofing representations. Specifically, spectral patterns up to second-order are fused in a coarse-to-fine manner and two branches are designed for the fine-level fusion from the spectral and temporal contexts. A reconstruction from the fused representation to the input spectrograms further reduces the potential fused information loss. Our method achieved the state-of-the-art performance with an EER of 0.77% on a widely used dataset: ASVspoof2019 LA Challenge.
Sound,Artificial Intelligence,Multimedia,Audio and Speech Processing
What problem does this paper attempt to address?
The problem this paper attempts to address is the robustness of audio anti-spoofing. With the development of deepfake technology, audio anti-spoofing has become increasingly challenging. Existing methods typically rely on specific types of audio features, but these features perform differently when detecting various types of attacks, limiting their effectiveness in dealing with multiple spoofing attacks. To improve the robustness of audio anti-spoofing, the authors propose a new deep learning method called S2pecNet. This method extracts richer audio features by fusing multi-order spectrograms (including first-order raw spectrograms and second-order power spectrograms), thereby enhancing the detection capability for different types of spoofing attacks. Specifically, S2pecNet adopts a coarse-to-fine fusion mechanism and designs two branches for fine-grained fusion from spectral and temporal contexts. Additionally, a reconstruction mechanism is introduced to reconstruct the fused representation back to the input spectrogram to reduce information loss. The main contributions of the paper include: 1. Proposing a novel deep learning fusion architecture based on multi-order spectrograms for robust audio anti-spoofing. 2. Designing a coarse-to-fine fusion mechanism that performs fine-grained fusion from spectral and temporal contexts through two branches. 3. Introducing a reconstruction strategy to preserve information in the fused speech representation. Experimental results show that S2pecNet achieves state-of-the-art performance on the ASVspoof2019 LA dataset, particularly excelling in the minimum detection cost function (min t-DCF) and equal error rate (EER) metrics.