Towards single integrated spoofing-aware speaker verification embeddings

Sung Hwan Mun,Hye-jin Shim,Hemlata Tak,Xin Wang,Xuechen Liu,Md Sahidullah,Myeonghun Jeong,Min Hyun Han,Massimiliano Todisco,Kong Aik Lee,Junichi Yamagishi,Nicholas Evans,Tomi Kinnunen,Nam Soo Kim,Jee-weon Jung
2023-06-01
Abstract:This study aims to develop a single integrated spoofing-aware speaker verification (SASV) embeddings that satisfy two aspects. First, rejecting non-target speakers' input as well as target speakers' spoofed inputs should be addressed. Second, competitive performance should be demonstrated compared to the fusion of automatic speaker verification (ASV) and countermeasure (CM) embeddings, which outperformed single embedding solutions by a large margin in the SASV2022 challenge. We analyze that the inferior performance of single SASV embeddings comes from insufficient amount of training data and distinct nature of ASV and CM tasks. To this end, we propose a novel framework that includes multi-stage training and a combination of loss functions. Copy synthesis, combined with several vocoders, is also exploited to address the lack of spoofed data. Experimental results show dramatic improvements, achieving a SASV-EER of 1.06% on the evaluation protocol of the SASV2022 challenge.
Audio and Speech Processing,Artificial Intelligence,Sound
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is to develop a single integrated anti - spoofing speaker verification (SASV) embedding model, which can meet the requirements of two aspects simultaneously: 1. **Recognize non - target speaker inputs and spoofed inputs of target speakers**: That is, the model needs to be able to effectively reject the voice inputs of non - target speakers, and at the same time, it can also recognize the spoofed voice inputs provided by the target speakers. 2. **Competitive performance compared with fusion methods**: Currently, the method of fusing automatic speaker verification (ASV) and counter - measure (CM) embeddings is significantly superior to single - embedding solutions in performance. The goal of this paper is to make the single - integrated SASV embedding model comparable to these fusion methods in performance. To achieve the above goals, the authors analyzed the reasons for the poor performance of the single SASV embedding model, mainly due to the insufficient amount of training data and the differences in the nature of ASV tasks and CM tasks. For this reason, they proposed a new framework, including the use of multi - stage training and loss function combinations. In addition, the replication synthesis technology combined with multiple vocoders was also utilized to deal with the lack of spoofed data. The experimental results show that this method significantly improves the performance and reaches an SASV - EER (equal error rate) of 1.06% in the evaluation protocol of the SASV2022 challenge.