Attentive Merging of Hidden Embeddings from Pre-trained Speech Model for Anti-spoofing Detection

Zihan Pan,Tianchi Liu,Hardik B. Sailor,Qiongqiong Wang
2024-06-12
Abstract:Self-supervised learning (SSL) speech representation models, trained on large speech corpora, have demonstrated effectiveness in extracting hierarchical speech embeddings through multiple transformer layers. However, the behavior of these embeddings in specific tasks remains uncertain. This paper investigates the multi-layer behavior of the WavLM model in anti-spoofing and proposes an attentive merging method to leverage the hierarchical hidden embeddings. Results demonstrate the feasibility of fine-tuning WavLM to achieve the best equal error rate (EER) of 0.65%, 3.50%, and 3.19% on the ASVspoof 2019LA, 2021LA, and 2021DF evaluation sets, respectively. Notably, We find that the early hidden transformer layers of the WavLM large model contribute significantly to anti-spoofing task, enabling computational efficiency by utilizing a partial pre-trained model.
Computation and Language,Sound,Audio and Speech Processing
What problem does this paper attempt to address?
The main aim of this paper is to address the issue of audio anti-spoofing detection. Specifically, it investigates how to effectively utilize the pre-trained speech model WavLM for audio anti-spoofing tasks. The core contribution of the paper is the proposal of an Attentive Merging (AttM) method, which is used to merge the multi-layer hidden embeddings of the WavLM model to improve the performance of anti-spoofing detection. Specifically, the paper attempts to solve the following key issues: 1. **The effectiveness of self-supervised learning (SSL) models in anti-spoofing tasks**: Investigate the performance of SSL models, particularly WavLM, in anti-spoofing tasks and explore whether they can effectively identify fraudulent audio. 2. **Which hidden layers are most helpful for anti-spoofing tasks**: Analyze the importance of features extracted from different layers of the WavLM model in distinguishing between genuine and fake speech. 3. **Whether better performance can be achieved by using only part of the pre-trained model's layers**: Reduce the computational resource requirements by using only part of the model's layers while maintaining or improving the accuracy of anti-spoofing detection. To achieve these goals, the authors propose an innovative attentive merging framework that can extract and merge relevant information from multiple hidden layers of the WavLM model. Experimental results show that focusing on the early hidden layers of the model (approximately up to the 12th layer) can effectively improve the performance of anti-spoofing detection, and this method is more efficient than using the entire model. Additionally, by combining different classifiers (such as LSTM and ECAPA-TDNN), the paper demonstrates the superior performance of the proposed system on various evaluation datasets, including the ASVspoof 2019 and 2021 datasets. In summary, this study provides valuable insights and technical solutions for the development of technology in the field of speech anti-spoofing.