Physiological-Physical Feature Fusion for Automatic Voice Spoofing Detection

Junxiao Xue,Hao Zhou,Yabo Wang

DOI: https://doi.org/10.48550/arXiv.2109.00913

2021-09-01

Abstract:Speaker verification systems have been used in many production scenarios in recent years. Unfortunately, they are still highly prone to different kinds of spoofing attacks such as voice conversion and speech synthesis, etc. In this paper, we propose a new method base on physiological-physical feature fusion to deal with voice spoofing attacks. This method involves feature extraction, a densely connected convolutional neural network with squeeze and excitation block (SE-DenseNet), multi-scale residual neural network with squeeze and excitation block (SE-Res2Net) and feature fusion strategies. We first pre-trained a convolutional neural network using the speaker's voice and face in the video as surveillance signals. It can extract physiological features from speech. Then we use SE-DenseNet and SE-Res2Net to extract physical features. Such a densely connection pattern has high parameter efficiency and squeeze and excitation block can enhance the transmission of the feature. Finally, we integrate the two features into the SE-Densenet to identify the spoofing attacks. Experimental results on the ASVspoof 2019 data set show that our model is effective for voice spoofing detection. In the logical access scenario, our model improves the tandem decision cost function (t-DCF) and equal error rate (EER) scores by 4% and 7%, respectively, compared with other methods. In the physical access scenario, our model improved t-DCF and EER scores by 8% and 10%, respectively.

Audio and Speech Processing,Sound,Image and Video Processing

What problem does this paper attempt to address?

The problem that this paper attempts to solve is various spoofing attacks faced in the Automatic Speech Verification (ASV) system, such as voice conversion (VC), text - to - speech synthesis (TTS) and replay attack. These attack methods can generate audio very close to real voices, posing a serious threat to existing ASV systems, especially in Logical Access (LA) and Physical Access (PA) scenarios. In order to improve the security of the ASV system and prevent these spoofing attacks, this paper proposes a new method based on the fusion of physiological - physical features to detect voice spoofing. Specifically, this paper proposes the following innovations: 1. A new convolutional network for audio spoofing detection is designed. This network combines dense connection and squeeze and excitation block, which improves parameter efficiency and enhances feature transmission. 2. Physiological features and physical features are combined for audio spoofing detection. Facial features and voice features are extracted through the designed network, and the two features are fused. Experiments show that this fused feature can improve the performance of the model. 3. The developed network model outperforms existing methods in the ASVspoof 2019 challenge. For example, in the logical access scenario, compared with the state - of - the - art method, the model improves the tandem decision cost function (t - DCF) and the equal error rate (EER) by 28% and 11% respectively. Through these innovations, this paper aims to provide a more robust solution to deal with unknown spoofing attacks, thereby enhancing the security and reliability of the ASV system.

Physiological-Physical Feature Fusion for Automatic Voice Spoofing Detection

End-to-end Spoofing Speech Detection and Knowledge Distillation under Noisy Conditions

Fast and Lightweight Voice Replay Attack Detection Via Time-frequency Spectrum Difference

Siamese Network with Wav2vec Feature for Spoofing Speech Detection

Cross-modal Information Fusion for Voice Spoofing Detection.

Densely Connected Convolutional Network for Audio Spoofing Detection.

Voice Presentation Attack Detection Using Convolutional Neural Networks

Attention-Based Convolutional Neural Network for ASV Spoofing Detection.

Deep Feature Engineering for Noise Robust Spoofing Detection

Voice spoofing detection with raw waveform based on Dual Path Res2net

Deep Features for Automatic Spoofing Detection

Voice Spoofing Countermeasure for Voice Replay Attacks Using Deep Learning

Spoof Speech Detection Based on Extended Constant-Q Symmetric Subband Cepstrum Coefficients and Fused Features

A blended framework for audio spoof detection with sequential models and bags of auditory bites

Tackling Spoofing-Aware Speaker Verification with Multi-Model Fusion.

STATNet: Spectral and Temporal features based Multi-Task Network for Audio Spoofing Detection

Bridging the Spoof Gap: A Unified Parallel Aggregation Network for Voice Presentation Attacks

Anti-Spoofing Speaker Verification System with Multi-Feature Integration and Multi-Task Learning

Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection

Voice spoofing detection using a neural networks assembly considering spectrograms and mel frequency cepstral coefficients

Small-footprint convolutional neural network for spoofing detection