Abstract:Self-supervised learning (SSL) speech representation models, trained on large speech corpora, have demonstrated effectiveness in extracting hierarchical speech embeddings through multiple transformer layers. However, the behavior of these embeddings in specific tasks remains uncertain. This paper investigates the multi-layer behavior of the WavLM model in anti-spoofing and proposes an attentive merging method to leverage the hierarchical hidden embeddings. Results demonstrate the feasibility of fine-tuning WavLM to achieve the best equal error rate (EER) of 0.65%, 3.50%, and 3.19% on the ASVspoof 2019LA, 2021LA, and 2021DF evaluation sets, respectively. Notably, We find that the early hidden transformer layers of the WavLM large model contribute significantly to anti-spoofing task, enabling computational efficiency by utilizing a partial pre-trained model.

What problem does this paper attempt to address?

The main aim of this paper is to address the issue of audio anti-spoofing detection. Specifically, it investigates how to effectively utilize the pre-trained speech model WavLM for audio anti-spoofing tasks. The core contribution of the paper is the proposal of an Attentive Merging (AttM) method, which is used to merge the multi-layer hidden embeddings of the WavLM model to improve the performance of anti-spoofing detection. Specifically, the paper attempts to solve the following key issues: 1. **The effectiveness of self-supervised learning (SSL) models in anti-spoofing tasks**: Investigate the performance of SSL models, particularly WavLM, in anti-spoofing tasks and explore whether they can effectively identify fraudulent audio. 2. **Which hidden layers are most helpful for anti-spoofing tasks**: Analyze the importance of features extracted from different layers of the WavLM model in distinguishing between genuine and fake speech. 3. **Whether better performance can be achieved by using only part of the pre-trained model's layers**: Reduce the computational resource requirements by using only part of the model's layers while maintaining or improving the accuracy of anti-spoofing detection. To achieve these goals, the authors propose an innovative attentive merging framework that can extract and merge relevant information from multiple hidden layers of the WavLM model. Experimental results show that focusing on the early hidden layers of the model (approximately up to the 12th layer) can effectively improve the performance of anti-spoofing detection, and this method is more efficient than using the entire model. Additionally, by combining different classifiers (such as LSTM and ECAPA-TDNN), the paper demonstrates the superior performance of the proposed system on various evaluation datasets, including the ASVspoof 2019 and 2021 datasets. In summary, this study provides valuable insights and technical solutions for the development of technology in the field of speech anti-spoofing.

Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection

End-to-end Spoofing Speech Detection and Knowledge Distillation under Noisy Conditions

Siamese Network with Wav2vec Feature for Spoofing Speech Detection

Enhancing Out-of-Domain Detection for Speech Spoofing Countermeasure Via Supervised Contrastive Learning

Toward Improving Synthetic Audio Spoofing Detection Robustness via Meta-Learning and Disentangled Training With Adversarial Examples

WavLM model ensemble for audio deepfake detection

Investigating Raw Wave Deep Neural Networks for End-to-End Speaker Spoofing Detection

Experimental Study: Enhancing Voice Spoofing Detection Models with wav2vec 2.0

Audio Deepfake Detection with Self-Supervised WavLM and Multi-Fusion Attentive Classifier

How to Boost Anti-Spoofing with X-Vectors.

A Lightweight and Efficient Model for Audio Anti-Spoofing.

Explore the Use of Self-supervised Pre-trained Acoustic Features on Disguised Speech Detection

A Light CNN with Split Batch Normalization for Spoofed Speech Detection Using Data Augmentation

Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?

Spoofing Speech Detection by Modeling Local Spectro-Temporal and Long-term Dependency

Interpretable Temporal Class Activation Representation for Audio Spoofing Detection

End-to-end Spoofing Detection with Raw Waveform CLDNNS.

Representation Selective Self-distillation and wav2vec 2.0 Feature Exploration for Spoof-aware Speaker Verification

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing

XLSR-Mamba: A Dual-Column Bidirectional State Space Model for Spoofing Attack Detection

Spoofing-Aware Speaker Verification by Multi-Level Fusion