Abstract:Automatic speaker verification (ASV) systems are highly vulnerable to presentation attacks, also called spoofing attacks. Replay is among the simplest attacks to mount — yet difficult to detect reliably. The generalization failure of spoofing countermeasures (CMs) has driven the community to study various alternative deep learning CMs. The majority of them are supervised approaches that learn a human-spoof discriminator. In this paper, we advocate a different, deep generative approach that leverages from powerful unsupervised manifold learning in classification. The potential benefits include the possibility to sample new data, and to obtain insights to the latent features of genuine and spoofed speech. To this end, we propose to use variational autoencoders (VAEs) as an alternative backend for replay attack detection, via three alternative models that differ in their class-conditioning. The first one, similar to the use of Gaussian mixture models (GMMs) in spoof detection, is to train independently two VAEs — one for each class. The second one is to train a single conditional model (C-VAE) by injecting a one-hot class label vector to the encoder and decoder networks. Our final proposal integrates an auxiliary classifier to guide the learning of the latent space. Our experimental results using constant-Q cepstral coefficient (CQCC) features on the ASVspoof 2017 and 2019 physical access subtask datasets indicate that the C-VAE offers substantial improvement in comparison to training two separate VAEs for each class. On the 2019 dataset, the C-VAE outperforms the VAE and the baseline GMM by an absolute 9 - 10% in both equal error rate (EER) and tandem detection cost function (t-DCF) metrics. Finally, we propose VAE residuals — the absolute difference of the original input and the reconstruction as features for spoofing detection. The proposed frontend approach augmented with a convolutional neural network classifier demonstrated substantial improvement over the VAE backend use case.

Joint Decision of Anti-Spoofing and Automatic Speaker Verification by Multi-Task Learning With Contrastive Loss

Multi-task learning of deep neural networks for joint automatic speaker verification and spoofing detection

VarASV: Enabling Pitch-variable Automatic Speaker Verification Via Multi-task Learning

Enhancing Out-of-Domain Detection for Speech Spoofing Countermeasure Via Supervised Contrastive Learning

Multi-task Learning Based Spoofing-Robust Automatic Speaker Verification System

Tackling Spoofing-Aware Speaker Verification with Multi-Model Fusion.

Voice Presentation Attack Detection Using Convolutional Neural Networks

Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches

Deep generative variational autoencoding for replay spoof detection in automatic speaker verification

Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Toward Improving Synthetic Audio Spoofing Detection Robustness via Meta-Learning and Disentangled Training With Adversarial Examples

Audio Anti-spoofing Using a Simple Attention Module and Joint Optimization Based on Additive Angular Margin Loss and Meta-learning

Two Methods for Spoofing-Aware Speaker Verification: Multi-Layer Perceptron Score Fusion Model and Integrated Embedding Projector

Simultaneous Utilization of Spectral Magnitude and Phase Information to Extract Supervectors for Speaker Verification Anti-Spoofing

Spoofing-Aware Speaker Verification by Multi-Level Fusion

Cross-modal Mask Fusion and Modality-Balanced Audio-Visual Speech Recognition

Can spoofing countermeasure and speaker verification systems be jointly optimised?

An initial investigation on optimizing tandem speaker verification and countermeasure systems using reinforcement learning

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation