Abstract:Voice spoofing attempts to break into a specific automatic speaker verification (ASV) system by forging the user's voice and can be used through methods such as text-to-speech (TTS), voice conversion (VC), and replay attacks. Recently, deep learning-based voice spoofing countermeasures have been developed. However, the problem with replay is that it is difficult to construct a large number of datasets because it requires a physical recording process. To overcome these problems, this study proposes a pre-training framework based on multi-order acoustic simulation for replay voice spoofing detection. Multi-order acoustic simulation utilizes existing clean signal and room impulse response (RIR) datasets to generate audios, which simulate the various acoustic configurations of the original and replayed audios. The acoustic configuration refers to factors such as the microphone type, reverberation, time delay, and noise that may occur between a speaker and microphone during the recording process. We assume that a deep learning model trained on an audio that simulates the various acoustic configurations of the original and replayed audios can classify the acoustic configurations of the original and replay audios well. To validate this, we performed pre-training to classify the audio generated by the multi-order acoustic simulation into three classes: clean signal, audio simulating the acoustic configuration of the original audio, and audio simulating the acoustic configuration of the replay audio. We also set the weights of the pre-training model to the initial weights of the replay voice spoofing detection model using the existing replay voice spoofing dataset and then performed fine-tuning. To validate the effectiveness of the proposed method, we evaluated the performance of the conventional method without pre-training and proposed method using an objective metric, i.e., the accuracy and F1-score. As a result, the conventional method achieved an accuracy of 92.94%, F1-score of 86.92% and the proposed method achieved an accuracy of 98.16%, F1-score of 95.08%.

Explore the Use of Self-supervised Pre-trained Acoustic Features on Disguised Speech Detection

End-to-end Spoofing Speech Detection and Knowledge Distillation under Noisy Conditions

Siamese Network with Wav2vec Feature for Spoofing Speech Detection

Self-Attention and Hybrid Features for Replay and Deep-Fake Audio Detection

Identification of Speaker from Disguised Voice Using MFCC Feature Extraction, Chi-Square and Classification Technique

Investigation of Self-supervised Pre-trained Models for Classification of Voice Quality from Speech and Neck Surface Accelerometer Signals

Voice Presentation Attack Detection Using Convolutional Neural Networks

Exploring Large Scale Pre-Trained Models for Robust Machine Anomalous Sound Detection

A Pre-Training Framework Based on Multi-Order Acoustic Simulation for Replay Voice Spoofing Detection

Supervised and Self-supervised Pretraining Based COVID-19 Detection Using Acoustic Breathing/Cough/Speech Signals

Experimental Case Study of Self-Supervised Learning for Voice Spoofing Detection

Recombinant influenza-virus vaccines. IV. Segregation of antigenic and some biological properties of Influenza virus neuraminidase by recombination.

Detecting Alzheimer's Disease Based on Acoustic Features Extracted from Pre-trained Models

Audio Deepfake Detection with Self-Supervised WavLM and Multi-Fusion Attentive Classifier

Acoustic features analysis for explainable machine learning-based audio spoofing detection

The cause of cirrhosis.

When Automatic Voice Disguise Meets Automatic Speaker Verification

Speaker Change Detection with Weighted-sum Knowledge Distillation Based on Self-supervised Pre-trained Models

A Semi-Supervised Speech Deception Detection Algorithm Combining Acoustic Statistical Features and Time-Frequency Two-Dimensional Features

Physiological-Physical Feature Fusion for Automatic Voice Spoofing Detection

Voice Deepfake Detection Using the Self-Supervised Pre-Training Model HuBERT