Complex-valued neural networks for voice anti-spoofing

Nicolas M. Müller,Philip Sperl,Konstantin Böttinger
2023-08-23
Abstract:Current anti-spoofing and audio deepfake detection systems use either magnitude spectrogram-based features (such as CQT or Melspectrograms) or raw audio processed through convolution or sinc-layers. Both methods have drawbacks: magnitude spectrograms discard phase information, which affects audio naturalness, and raw-feature-based models cannot use traditional explainable AI methods. This paper proposes a new approach that combines the benefits of both methods by using complex-valued neural networks to process the complex-valued, CQT frequency-domain representation of the input audio. This method retains phase information and allows for explainable AI methods. Results show that this approach outperforms previous methods on the "In-the-Wild" anti-spoofing dataset and enables interpretation of the results through explainable AI. Ablation studies confirm that the model has learned to use phase information to detect voice spoofing.
Sound,Machine Learning,Audio and Speech Processing
What problem does this paper attempt to address?
The paper attempts to address two major shortcomings in current voice anti-spoofing and audio deepfake detection systems: 1. **Magnitude Spectrogram Method**: This method converts time-domain waveforms into magnitude spectrograms using short-time Fourier transform (STFT) or other techniques, but it discards phase information. Phase information is crucial for the naturalness of audio, especially in the speech-to-text (STT) field, where it is often necessary to regenerate phase information using methods like Griffin-Lim or neural vocoders. 2. **Raw Audio Processing Method**: This method directly processes raw audio data, extracting features through convolutional layers or sinc layers. Although this method performs well in terms of performance, it lacks transparency because existing explainable artificial intelligence (XAI) methods such as saliency maps and Smooth Grad require input data with spatial dimensions (i.e., at least 2D input), whereas raw audio is a 1D vector. To address these issues, the paper proposes a new approach that utilizes complex-valued neural networks (CVNN) to process the complex constant-Q transform (CQT) frequency domain representation of input audio. This method retains phase information and allows the use of explainable AI techniques. Experimental results show that this method outperforms existing magnitude spectrogram methods and raw feature methods on the "In-the-Wild" anti-spoofing dataset. Additionally, ablation studies confirm that the model has learned to utilize phase information to detect voice spoofing.