Abstract:Automatic speaker verification (ASV) systems are highly vulnerable to presentation attacks, also called spoofing attacks. Replay is among the simplest attacks to mount — yet difficult to detect reliably. The generalization failure of spoofing countermeasures (CMs) has driven the community to study various alternative deep learning CMs. The majority of them are supervised approaches that learn a human-spoof discriminator. In this paper, we advocate a different, deep generative approach that leverages from powerful unsupervised manifold learning in classification. The potential benefits include the possibility to sample new data, and to obtain insights to the latent features of genuine and spoofed speech. To this end, we propose to use variational autoencoders (VAEs) as an alternative backend for replay attack detection, via three alternative models that differ in their class-conditioning. The first one, similar to the use of Gaussian mixture models (GMMs) in spoof detection, is to train independently two VAEs — one for each class. The second one is to train a single conditional model (C-VAE) by injecting a one-hot class label vector to the encoder and decoder networks. Our final proposal integrates an auxiliary classifier to guide the learning of the latent space. Our experimental results using constant-Q cepstral coefficient (CQCC) features on the ASVspoof 2017 and 2019 physical access subtask datasets indicate that the C-VAE offers substantial improvement in comparison to training two separate VAEs for each class. On the 2019 dataset, the C-VAE outperforms the VAE and the baseline GMM by an absolute 9 - 10% in both equal error rate (EER) and tandem detection cost function (t-DCF) metrics. Finally, we propose VAE residuals — the absolute difference of the original input and the reconstruction as features for spoofing detection. The proposed frontend approach augmented with a convolutional neural network classifier demonstrated substantial improvement over the VAE backend use case.

Posterior sampling algorithms for unsupervised speech enhancement with recurrent variational autoencoder

A Recurrent Variational Autoencoder for Speech Enhancement

Unsupervised Speech Enhancement using Dynamical Variational Auto-Encoders

RVAE-EM: Generative speech dereverberation based on recurrent variational auto-encoder and convolutive transfer function

Statistical Speech Enhancement Based on Probabilistic Integration of Variational Autoencoder and Non-Negative Matrix Factorization

A weighted-variance variational autoencoder model for speech enhancement

A Two-Stage Deep Representation Learning-Based Speech Enhancement Method Using Variational Autoencoder and Adversarial Training

A variance modeling framework based on variational autoencoders for speech enhancement

Unsupervised speech enhancement with diffusion-based generative models

Audio-visual speech enhancement with a deep Kalman filter generative model

Diffusion-based Unsupervised Audio-visual Speech Enhancement

A Deep Representation Learning-based Speech Enhancement Method Using Complex Convolution Recurrent Variational Autoencoder

ESVAE: An Efficient Spiking Variational Autoencoder with Reparameterizable Poisson Spiking Sampling

A New Ar Model Based Speech Enhancement Approach Within Variational Bayesian Framework

Speech enhancement based on estimating expected values of speech cepstra

Variational Bayesian Learning for Speech Modeling and Enhancement

Deep generative variational autoencoding for replay spoof detection in automatic speaker verification

Improved far-field speech recognition using Joint Variational Autoencoder

A Modified Speech Enhancement Algorithm Using a Universal Speaker Model

Convolutional Variational Autoencoders for Spectrogram Compression in Automatic Speech Recognition

Variational Auto-Encoder Based Variability Encoding for Dysarthric Speech Recognition