Complex-valued neural networks for voice anti-spoofing

Nicolas M. Müller,Philip Sperl,Konstantin Böttinger

2023-08-23

Abstract:Current anti-spoofing and audio deepfake detection systems use either magnitude spectrogram-based features (such as CQT or Melspectrograms) or raw audio processed through convolution or sinc-layers. Both methods have drawbacks: magnitude spectrograms discard phase information, which affects audio naturalness, and raw-feature-based models cannot use traditional explainable AI methods. This paper proposes a new approach that combines the benefits of both methods by using complex-valued neural networks to process the complex-valued, CQT frequency-domain representation of the input audio. This method retains phase information and allows for explainable AI methods. Results show that this approach outperforms previous methods on the "In-the-Wild" anti-spoofing dataset and enables interpretation of the results through explainable AI. Ablation studies confirm that the model has learned to use phase information to detect voice spoofing.

Sound,Machine Learning,Audio and Speech Processing

What problem does this paper attempt to address?

The paper attempts to address two major shortcomings in current voice anti-spoofing and audio deepfake detection systems: 1. **Magnitude Spectrogram Method**: This method converts time-domain waveforms into magnitude spectrograms using short-time Fourier transform (STFT) or other techniques, but it discards phase information. Phase information is crucial for the naturalness of audio, especially in the speech-to-text (STT) field, where it is often necessary to regenerate phase information using methods like Griffin-Lim or neural vocoders. 2. **Raw Audio Processing Method**: This method directly processes raw audio data, extracting features through convolutional layers or sinc layers. Although this method performs well in terms of performance, it lacks transparency because existing explainable artificial intelligence (XAI) methods such as saliency maps and Smooth Grad require input data with spatial dimensions (i.e., at least 2D input), whereas raw audio is a 1D vector. To address these issues, the paper proposes a new approach that utilizes complex-valued neural networks (CVNN) to process the complex constant-Q transform (CQT) frequency domain representation of input audio. This method retains phase information and allows the use of explainable AI techniques. Experimental results show that this method outperforms existing magnitude spectrogram methods and raw feature methods on the "In-the-Wild" anti-spoofing dataset. Additionally, ablation studies confirm that the model has learned to utilize phase information to detect voice spoofing.

Complex-valued neural networks for voice anti-spoofing

End-to-end Spoofing Speech Detection and Knowledge Distillation under Noisy Conditions

Voice Presentation Attack Detection Using Convolutional Neural Networks

ConvNeXt Based Neural Network for Audio Anti-Spoofing

Acoustic features analysis for explainable machine learning-based audio spoofing detection

Source Tracing of Audio Deepfake Systems

STATNet: Spectral and Temporal features based Multi-Task Network for Audio Spoofing Detection

Does Audio Deepfake Detection Generalize?

Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models

MelCochleaGram-DeepCNN: Sequentially Fused Spectrogram and the DeepCNN Classifiers-based Audio Spoof Detection System

AI-Synthesized Voice Detection Using Neural Vocoder Artifacts

Securing Voice Biometrics: One-Shot Learning Approach for Audio Deepfake Detection

Robust Audio Anti-Spoofing System Based on Low-Frequency Sub-Band Information

Voice Spoofing Countermeasure for Voice Replay Attacks Using Deep Learning

Adversarial Post-Processing of Voice Conversion Against Spoofing Detection

Voice spoofing detection using a neural networks assembly considering spectrograms and mel frequency cepstral coefficients

Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection

A lightweight feature extraction technique for deepfake audio detection

Physiological-Physical Feature Fusion for Automatic Voice Spoofing Detection

Small-footprint convolutional neural network for spoofing detection

Audio Spoofing Verification using Deep Convolutional Neural Networks by Transfer Learning