Abstract:In this paper, we propose a deep learning based system for the task of deepfake audio detection. In particular, the draw input audio is first transformed into various spectrograms using three transformation methods of Short-time Fourier Transform (STFT), Constant-Q Transform (CQT), Wavelet Transform (WT) combined with different auditory-based filters of Mel, Gammatone, linear filters (LF), and discrete cosine transform (DCT). Given the spectrograms, we evaluate a wide range of classification models based on three deep learning approaches. The first approach is to train directly the spectrograms using our proposed baseline models of CNN-based model (CNN-baseline), RNN-based model (RNN-baseline), C-RNN model (C-RNN baseline). Meanwhile, the second approach is transfer learning from computer vision models such as ResNet-18, MobileNet-V3, EfficientNet-B0, DenseNet-121, SuffleNet-V2, Swint, Convnext-Tiny, GoogLeNet, MNASsnet, RegNet. In the third approach, we leverage the state-of-the-art audio pre-trained models of Whisper, Seamless, Speechbrain, and Pyannote to extract audio embeddings from the input spectrograms. Then, the audio embeddings are explored by a Multilayer perceptron (MLP) model to detect the fake or real audio samples. Finally, high-performance deep learning models from these approaches are fused to achieve the best performance. We evaluated our proposed models on ASVspoof 2019 benchmark dataset. Our best ensemble model achieved an Equal Error Rate (EER) of 0.03, which is highly competitive to top-performing systems in the ASVspoofing 2019 challenge. Experimental results also highlight the potential of selective spectrograms and deep learning approaches to enhance the task of audio deepfake detection.

A lightweight feature extraction technique for deepfake audio detection

Efficient Deepfake Audio Detection Using Spectro-Temporal Analysis and Deep Learning

Acoustic features analysis for explainable machine learning-based audio spoofing detection

Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models

Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection

MFAAN: Unveiling Audio Deepfakes with a Multi-Feature Authenticity Network

MelCochleaGram-DeepCNN: Sequentially Fused Spectrogram and the DeepCNN Classifiers-based Audio Spoof Detection System

Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features

Audio Deepfake Detection by using Machine and Deep Learning

Source Tracing of Audio Deepfake Systems

Audio Deepfake Detection with Self-Supervised WavLM and Multi-Fusion Attentive Classifier

A blended framework for audio spoof detection with sequential models and bags of auditory bites

Fully Automated End-to-End Fake Audio Detection.

Speaker Recognition-Assisted Robust Audio Deepfake Detection

A Comparative Study on Physical and Perceptual Features for Deepfake Audio Detection

Transferring Audio Deepfake Detection Capability Across Languages

Does Audio Deepfake Detection Generalize?

An improved feature extraction for Hindi language audio impersonation attack detection

A robust audio deepfake detection system via multi-view feature

Robust Audio Anti-Spoofing System Based on Low-Frequency Sub-Band Information

A Multimodal Framework for Deepfake Detection