Abstract:The automatic speaker verification system is sometimes vulnerable to various spoofing attacks. The 2-class Gaussian Mixture Model classifier for genuine and spoofed speech is usually used as the baseline for spoofing detection. However, the GMM classifier does not separately consider the scores of feature frames on each Gaussian component. In addition, the GMM accumulates the scores on all frames independently, and does not consider their correlations. We propose the two-path GMM-ResNet and GMM-SENet models for spoofing detection, whose input is the Gaussian probability features based on two GMMs trained on genuine and spoofed speech respectively. The models consider not only the score distribution on GMM components, but also the relationship between adjacent frames. A two-step training scheme is applied to improve the system robustness. Experiments on the ASVspoof 2019 show that the LFCC+GMM-ResNet system can relatively reduce min-tDCF and EER by 76.1% and 76.3% on logical access scenario compared with the GMM, and the LFCC+GMM-SENet system by 94.4% and 95.4% on physical access scenario. After score fusion, the systems give the second-best results on both scenarios.

What problem does this paper attempt to address?

This paper attempts to solve the vulnerability problem of automatic speaker verification systems (ASV) to various spoofing attacks. The traditional two - class Gaussian Mixture Model (GMM) classifier is usually used as a baseline method when detecting real - voice and spoof - voice. However, the GMM classifier has two main problems: 1. **Not considering the feature frame scores on each Gaussian component separately**: When calculating the final score, the GMM classifier does not consider the score distribution information on each Gaussian component respectively. 2. **Ignoring the correlation between frames**: The GMM classifier accumulates the scores of all frames independently without considering the relationship between adjacent frames. To overcome these problems, the paper proposes two new models: two - path GMM - ResNet and GMM - SENet. These models consider not only the score distribution on GMM components but also the relationship between adjacent frames. By introducing a two - step training scheme, the robustness of the system is further improved. Specifically, the main contributions of the paper include: - **Two - path architecture**: Use two GMMs (one trained on real - voice and the other trained on spoof - voice) to extract Gaussian probability features, and then input them into two identical ResNet or SENet models for processing. Finally, the embedding vectors of the two paths are spliced together and input into the fully - connected layer for spoof detection. - **Two - step training scheme**: First, train the convolutional layers and residual blocks, and then remove the temporary fully - connected layer to train the overall architecture. This helps to avoid the over - fitting problem and improve the generalization ability of the model. The experimental results show that the proposed models significantly outperform the baseline system on the ASVspoof 2019 database. Especially in the logical access (LA) and physical access (PA) scenarios, the performance improvement is very obvious. For example, the min - tDCF and EER of the LFCC+GMM - ResNet(2P2S) system are relatively reduced by 76.1% and 76.3% respectively in the LA scenario, and are relatively reduced by 94.4% and 95.4% respectively in the PA scenario.

Two-Path GMM-ResNet and GMM-SENet for ASV Spoofing Detection

Siamese Network with Wav2vec Feature for Spoofing Speech Detection

The GMM and I-Vector Systems Based on Spoofing Algorithms for Speaker Spoofing Detection.

End-to-end Spoofing Speech Detection and Knowledge Distillation under Noisy Conditions

Enhancing Out-of-Domain Detection for Speech Spoofing Countermeasure Via Supervised Contrastive Learning

GMM-ResNet2: Ensemble of Group ResNet Networks for Synthetic Speech Detection

Generalizing Speaker Verification for Spoof Awareness in the Embedding Space

Replay detection using CQT-based modified group delay feature and ResNeWt network in ASVspoof 2019

Channel-wise Gated Res2Net: Towards Robust Detection of Synthetic Speech Attacks

Deep generative variational autoencoding for replay spoof detection in automatic speaker verification

Spoofing Speaker Verification System by Adversarial Examples Leveraging the Generalized Speaker Difference.

Spoofing Speaker Verification Systems with Deep Multi-speaker Text-to-speech Synthesis

Anti-spoofing Methods for Automatic SpeakerVerification System

How to Boost Anti-Spoofing with X-Vectors.

Investigating Raw Wave Deep Neural Networks for End-to-End Speaker Spoofing Detection

Multi-task Learning Based Spoofing-Robust Automatic Speaker Verification System

Multi-task learning of deep neural networks for joint automatic speaker verification and spoofing detection

Voice spoofing detection with raw waveform based on Dual Path Res2net

Tackling Spoofing-Aware Speaker Verification with Multi-Model Fusion.

Spoofing Detection Goes Noisy: An Analysis of Synthetic Speech Detection in the Presence of Additive Noise