Enhancing Anti-spoofing Countermeasures Robustness through Joint Optimization and Transfer Learning

Yikang Wang,Xingming Wang,Hiromitsu Nishizaki,Ming Li
2024-07-29
Abstract:Current research in synthesized speech detection primarily focuses on the generalization of detection systems to unknown spoofing methods of noise-free speech. However, the performance of anti-spoofing countermeasures (CM) system is often don't work as well in more challenging scenarios, such as those involving noise and reverberation. To address the problem of enhancing the robustness of CM systems, we propose a transfer learning-based speech enhancement front-end joint optimization (TL-SEJ) method, investigating its effectiveness in improving robustness against noise and reverberation. We evaluated the proposed method's performance through a series of comparative and ablation experiments. The experimental results show that, across different signal-to-noise ratio test conditions, the proposed TL-SEJ method improves recognition accuracy by 2.7% to 15.8% compared to the baseline. Compared to conventional data augmentation methods, our system achieves an accuracy improvement ranging from 0.7% to 5.8% in various noisy conditions and from 1.7% to 2.8% under different RT60 reverberation scenarios. These experiments demonstrate that the proposed method effectively enhances system robustness in noisy and reverberant conditions.
Sound,Audio and Speech Processing,Signal Processing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to enhance the robustness of anti - spoofing measures in noisy and reverberant environments. Specifically, current anti - spoofing systems perform well when dealing with clean speech, but in more complex scenarios, such as in the presence of noise and reverberation, their performance will decline significantly. To solve this problem, the author proposes a speech enhancement front - end method based on transfer learning and joint optimization (TL - SEJ) to improve the robustness of anti - spoofing systems. ### Background of the Paper With the progress of deep - learning technology, speech synthesis technologies (such as voice conversion VC and text - to - speech TTS) have been able to generate high - quality, natural and expressive human voices. However, the potential abuse of these technologies poses a serious threat to automatic speaker verification (ASV) systems and may endanger social security, political stability and economic integrity. Therefore, it is crucial to develop effective countermeasure (CM) systems. ### Current Challenges Current research mainly focuses on the generalization ability of anti - spoofing detection systems on clean speech, but when facing complex environments such as noise and reverberation, the performance of these systems is often not satisfactory. To meet this challenge, researchers have tried a variety of methods, including data augmentation, feature extraction optimization and model architecture improvement. ### Solutions in the Paper To solve the above problems, the author proposes the following innovations: 1. **Transfer Learning and Joint Optimization**: By introducing the knowledge of pre - trained models and combining transfer learning techniques, the pre - trained information is integrated into the existing joint training framework, thereby improving the system's robustness to noise and reverberation. 2. **Dual - Input U - Net Enhancement Network (DUMENet)**: A new front - end speech enhancement module is designed. This module adopts a dual - input structure, taking the FBANK features of noisy speech and clean speech as inputs and outputting a soft mask instead of directly reconstructing the clean speech signal. This method can effectively handle non - additive noise (such as reverberation) and avoid introducing additional artifacts. 3. **Unified Model Training**: Use a unified model under mixed noise conditions for training, so as to more accurately evaluate the generalization ability and robustness of the model. ### Experimental Results The experimental results show that the proposed TL - SEJ method improves the recognition accuracy by 2.7% to 15.8% compared with the baseline method under different signal - to - noise ratio conditions. In addition, under various noise conditions, compared with the traditional data augmentation method, the system accuracy is improved by 0.7% to 5.8%, and is improved by 1.7% to 2.8% in different RT60 reverberation scenarios. In conclusion, this paper successfully improves the robustness of anti - spoofing systems in noisy and reverberant environments by introducing transfer learning and joint optimization methods, providing new ideas and technical means for future research.