Abstract:The Automatic Speaker Verification systems have potential in biometrics applications for logical control access and authentication. A lot of things happen to be at stake if the ASV system is compromised. The preliminary work presents a comparative analysis of the wavelet and MFCC-based state-of-the-art spoof detection techniques developed in these papers, respectively (Novoselov et al., 2016) (Alam et al., 2016a). The results on ASVspoof 2015 justify our inclination towards wavelet-based features instead of MFCC features. The experiments on the ASVspoof 2019 database show the lack of credibility of the traditional handcrafted features and give us more reason to progress towards using end-to-end deep neural networks and more recent techniques. We use Sincnet architecture as our baseline. We get E2E deep learning models, which we call WSTnet and CWTnet, respectively, by replacing the Sinc layer with the Wavelet Scattering and Continuous wavelet transform layers. The fusion model achieved 62% and 17% relative improvement over traditional handcrafted models and our Sincnet baseline when evaluated on the modern spoofing attacks in ASVspoof 2019. The final scale distribution and the number of scales used in CWTnet are far from optimal for the task at hand. So to solve this problem, we replaced the CWT layer with a Wavelet Deconvolution(WD) (Khan and Yener, 2018) layer in our CWTnet architecture. This layer calculates the Discrete-Continuous Wavelet Transform similar to the CWTnet but also optimizes the scale parameter using back-propagation. The WDnet model achieved 26% and 7% relative improvement over CWTnet and Sincnet models respectively when evaluated over ASVspoof 2019 dataset. This shows that more generalized features are extracted as compared to the features extracted by CWTnet as only the most important and relevant frequency regions are focused upon.

Spoofed Voice Detection using Dense Features of STFT and MDCT Spectrograms

Voice Presentation Attack Detection Using Convolutional Neural Networks

End-to-end Spoofing Speech Detection and Knowledge Distillation under Noisy Conditions

STATNet: Spectral and Temporal features based Multi-Task Network for Audio Spoofing Detection

Securing Voice Biometrics: One-Shot Learning Approach for Audio Deepfake Detection

Frame-to-Utterance Convergence: A Spectra-Temporal Approach for Unified Spoofing Detection

MelCochleaGram-DeepCNN: Sequentially Fused Spectrogram and the DeepCNN Classifiers-based Audio Spoof Detection System

Voice Spoofing Countermeasure for Voice Replay Attacks Using Deep Learning

Audio Spoofing Verification using Deep Convolutional Neural Networks by Transfer Learning

Gaussian-Filtered High-Frequency-Feature Trained Optimized BiLSTM Network for Spoofed-Speech Classification

Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection

A lightweight feature extraction technique for deepfake audio detection

Augmentation through Laundering Attacks for Audio Spoof Detection

Detection of Doctored Speech: Towards an End-to-End Parametric Learn-able Filter Approach

Audio Deepfake Detection by using Machine and Deep Learning

Investigating Causal Cues: Strengthening Spoofed Audio Detection with Human-Discernible Linguistic Features

An Audio Copy-Move Forgery Localization Model by CNN-Based Spectral Analysis

A blended framework for audio spoof detection with sequential models and bags of auditory bites

Voice spoofing detection using a neural networks assembly considering spectrograms and mel frequency cepstral coefficients

MFAAN: Unveiling Audio Deepfakes with a Multi-Feature Authenticity Network

Physiological-Physical Feature Fusion for Automatic Voice Spoofing Detection