Enhancing Open-Set Speaker Identification through Rapid Tuning with Speaker Reciprocal Points and Negative Sample

Zhiyong Chen,Zhiqi Ai,Xinnuo Li,Shugong Xu
2024-09-24
Abstract:This paper introduces a novel framework for open-set speaker identification in household environments, playing a crucial role in facilitating seamless human-computer interactions. Addressing the limitations of current speaker models and classification approaches, our work integrates an pretrained WavLM frontend with a few-shot rapid tuning neural network (NN) backend for enrollment, employing task-optimized Speaker Reciprocal Points Learning (SRPL) to enhance discrimination across multiple target speakers. Furthermore, we propose an enhanced version of SRPL (SRPL+), which incorporates negative sample learning with both speech-synthesized and real negative samples to significantly improve open-set SID accuracy. Our approach is thoroughly evaluated across various multi-language text-dependent speaker recognition datasets, demonstrating its effectiveness in achieving high usability for complex household multi-speaker recognition scenarios. The proposed system enhanced open-set performance by up to 27\% over the directly use of efficient WavLM base+ model.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to achieve more accurate open - set speaker identification (SID) in the home environment. Specifically, the existing speaker identification models and classification methods have limitations when dealing with complex multi - speaker home scenarios, especially when performing poorly in the face of unknown speakers (outliers). Therefore, this paper proposes a new framework to enhance the performance of open - set speaker identification. ### Main problems 1. **Limitations of existing models**: Current speaker models mainly focus on closed - set classification, that is, assuming that the test speech comes from a set of registered speakers. However, in the real world, especially in the home environment, unregistered speakers may be encountered, which poses a challenge to the robustness and accuracy of the system. 2. **Improving the accuracy of open - set identification**: In order to meet the challenges of unknown speakers, a method that can effectively distinguish between known and unknown speakers is required while maintaining a high recognition rate for known speakers. 3. **Rapid adaptation with a small number of samples**: In practical applications, the system may have only a small number of speaker samples for training or fine - tuning. Therefore, how to achieve rapid and effective model adjustment with a small number of samples is also a key issue. ### Solutions To solve the above problems, this paper proposes a framework that combines a pre - trained audio front - end model (such as WavLM) and a neural network back - end with rapid fine - tuning for a small number of samples, and introduces the following key techniques: - **Speaker Reciprocal Points Learning (SRPL)**: By optimizing the embedding representations of known speakers, ensure that the distance between them and Reciprocal Points (RPs) is maximized, thereby improving the discrimination. - **SRPL+**: On the basis of SRPL, further introduce negative sample learning, using synthetic and real negative samples to enhance the generalization ability of the model, especially for the identification of unknown speakers. - **Zero - shot Text - to - Speech (Zero - shot TTS)**: Utilize zero - shot text - to - speech synthesis technology to generate negative samples to supplement the shortage of real negative samples and further improve the robustness of the model. ### Summary The main objective of this paper is to significantly improve the accuracy and robustness of open - set speaker identification, especially in complex multi - speaker scenarios in the home environment, by introducing new learning mechanisms and negative sample enhancement techniques. Experimental results show that the proposed SRPL and SRPL + methods have a significant performance improvement compared to existing methods.