Abstract:Automatic speaker verification (ASV) systems are highly vulnerable to presentation attacks, also called spoofing attacks. Replay is among the simplest attacks to mount — yet difficult to detect reliably. The generalization failure of spoofing countermeasures (CMs) has driven the community to study various alternative deep learning CMs. The majority of them are supervised approaches that learn a human-spoof discriminator. In this paper, we advocate a different, deep generative approach that leverages from powerful unsupervised manifold learning in classification. The potential benefits include the possibility to sample new data, and to obtain insights to the latent features of genuine and spoofed speech. To this end, we propose to use variational autoencoders (VAEs) as an alternative backend for replay attack detection, via three alternative models that differ in their class-conditioning. The first one, similar to the use of Gaussian mixture models (GMMs) in spoof detection, is to train independently two VAEs — one for each class. The second one is to train a single conditional model (C-VAE) by injecting a one-hot class label vector to the encoder and decoder networks. Our final proposal integrates an auxiliary classifier to guide the learning of the latent space. Our experimental results using constant-Q cepstral coefficient (CQCC) features on the ASVspoof 2017 and 2019 physical access subtask datasets indicate that the C-VAE offers substantial improvement in comparison to training two separate VAEs for each class. On the 2019 dataset, the C-VAE outperforms the VAE and the baseline GMM by an absolute 9 - 10% in both equal error rate (EER) and tandem detection cost function (t-DCF) metrics. Finally, we propose VAE residuals — the absolute difference of the original input and the reconstruction as features for spoofing detection. The proposed frontend approach augmented with a convolutional neural network classifier demonstrated substantial improvement over the VAE backend use case.

Replay detection using CQT-based modified group delay feature and ResNeWt network in ASVspoof 2019

A multi-branch ResNet with discriminative features for detection of replay speech signals

Fast and Lightweight Voice Replay Attack Detection Via Time-frequency Spectrum Difference

Siamese Network with Wav2vec Feature for Spoofing Speech Detection

An Experimental Study on Replay Attack Detection Using Spoofing Clues from both Voiced and Non-Voiced Segments

Replay Attack Detection Using Integrated Glottal Excitation Based Group Delay Function and Cepstral Features

Two-Path GMM-ResNet and GMM-SENet for ASV Spoofing Detection

Deep generative variational autoencoding for replay spoof detection in automatic speaker verification

Replay and Synthetic Speech Detection with Res2net Architecture

Detection of Replay-Spoofing Attacks Using Frequency Modulation Features

Teager Energy Operator Based Features with x-vector for Replay Attack Detection.

Cross-database replay detection in terminal-dependent speaker verification

Voice Presentation Attack Detection Using Convolutional Neural Networks

Speech Replay Detection with x-Vector Attack Embeddings and Spectral Features

Transforming acoustic characteristics to deceive playback spoofing countermeasures of speaker verification systems

Replay attack detection using variable-frequency resolution phase and magnitude features

A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification.

USTC-KXDIGIT System Description for ASVspoof5 Challenge

Audio compression-assisted feature extraction for voice replay attack detection

Channel-wise Gated Res2Net: Towards Robust Detection of Synthetic Speech Attacks

Voice Spoofing Countermeasure for Voice Replay Attacks Using Deep Learning