Abstract:Robust audio anti-spoofing has been increasingly challenging due to the recent advancements on deepfake techniques. While spectrograms have demonstrated their capability for anti-spoofing, complementary information presented in multi-order spectral patterns have not been well explored, which limits their effectiveness for varying spoofing attacks. Therefore, we propose a novel deep learning method with a spectral fusion-reconstruction strategy, namely S2pecNet, to utilise multi-order spectral patterns for robust audio anti-spoofing representations. Specifically, spectral patterns up to second-order are fused in a coarse-to-fine manner and two branches are designed for the fine-level fusion from the spectral and temporal contexts. A reconstruction from the fused representation to the input spectrograms further reduces the potential fused information loss. Our method achieved the state-of-the-art performance with an EER of 0.77% on a widely used dataset: ASVspoof2019 LA Challenge.

What problem does this paper attempt to address?

The problem this paper attempts to address is the robustness of audio anti-spoofing. With the development of deepfake technology, audio anti-spoofing has become increasingly challenging. Existing methods typically rely on specific types of audio features, but these features perform differently when detecting various types of attacks, limiting their effectiveness in dealing with multiple spoofing attacks. To improve the robustness of audio anti-spoofing, the authors propose a new deep learning method called S2pecNet. This method extracts richer audio features by fusing multi-order spectrograms (including first-order raw spectrograms and second-order power spectrograms), thereby enhancing the detection capability for different types of spoofing attacks. Specifically, S2pecNet adopts a coarse-to-fine fusion mechanism and designs two branches for fine-grained fusion from spectral and temporal contexts. Additionally, a reconstruction mechanism is introduced to reconstruct the fused representation back to the input spectrogram to reduce information loss. The main contributions of the paper include: 1. Proposing a novel deep learning fusion architecture based on multi-order spectrograms for robust audio anti-spoofing. 2. Designing a coarse-to-fine fusion mechanism that performs fine-grained fusion from spectral and temporal contexts through two branches. 3. Introducing a reconstruction strategy to preserve information in the fused speech representation. Experimental results show that S2pecNet achieves state-of-the-art performance on the ASVspoof2019 LA dataset, particularly excelling in the minimum detection cost function (min t-DCF) and equal error rate (EER) metrics.

Robust Audio Anti-Spoofing with Fusion-Reconstruction Learning on Multi-Order Spectrograms

End-to-end Spoofing Speech Detection and Knowledge Distillation under Noisy Conditions

Fast and Lightweight Voice Replay Attack Detection Via Time-frequency Spectrum Difference

Siamese Network with Wav2vec Feature for Spoofing Speech Detection

Robust Audio Anti-Spoofing System Based on Low-Frequency Sub-Band Information

Frame-to-Utterance Convergence: A Spectra-Temporal Approach for Unified Spoofing Detection

Multi-perspective Information Fusion Res2Net with RandomSpecmix for Fake Speech Detection

Toward Improving Synthetic Audio Spoofing Detection Robustness via Meta-Learning and Disentangled Training With Adversarial Examples

Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection

Audio Anti-Spoofing Detection: A Survey

ConvNeXt Based Neural Network for Audio Anti-Spoofing

How to Boost Anti-Spoofing with X-Vectors.

Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

STATNet: Spectral and Temporal features based Multi-Task Network for Audio Spoofing Detection

Voice spoofing detection with raw waveform based on Dual Path Res2net

Audio Deepfake Detection with Self-Supervised WavLM and Multi-Fusion Attentive Classifier

Temporal Variability and Multi-Viewed Self-Supervised Representations to Tackle the ASVspoof5 Deepfake Challenge

Spoofing-Aware Speaker Verification by Multi-Level Fusion

Speaker-Aware Anti-Spoofing

Audio Anti-spoofing Using a Simple Attention Module and Joint Optimization Based on Additive Angular Margin Loss and Meta-learning