Abstract:This paper introduces a novel framework for open-set speaker identification in household environments, playing a crucial role in facilitating seamless human-computer interactions. Addressing the limitations of current speaker models and classification approaches, our work integrates an pretrained WavLM frontend with a few-shot rapid tuning neural network (NN) backend for enrollment, employing task-optimized Speaker Reciprocal Points Learning (SRPL) to enhance discrimination across multiple target speakers. Furthermore, we propose an enhanced version of SRPL (SRPL+), which incorporates negative sample learning with both speech-synthesized and real negative samples to significantly improve open-set SID accuracy. Our approach is thoroughly evaluated across various multi-language text-dependent speaker recognition datasets, demonstrating its effectiveness in achieving high usability for complex household multi-speaker recognition scenarios. The proposed system enhanced open-set performance by up to 27\% over the directly use of efficient WavLM base+ model.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to achieve more accurate open - set speaker identification (SID) in the home environment. Specifically, the existing speaker identification models and classification methods have limitations when dealing with complex multi - speaker home scenarios, especially when performing poorly in the face of unknown speakers (outliers). Therefore, this paper proposes a new framework to enhance the performance of open - set speaker identification. ### Main problems 1. **Limitations of existing models**: Current speaker models mainly focus on closed - set classification, that is, assuming that the test speech comes from a set of registered speakers. However, in the real world, especially in the home environment, unregistered speakers may be encountered, which poses a challenge to the robustness and accuracy of the system. 2. **Improving the accuracy of open - set identification**: In order to meet the challenges of unknown speakers, a method that can effectively distinguish between known and unknown speakers is required while maintaining a high recognition rate for known speakers. 3. **Rapid adaptation with a small number of samples**: In practical applications, the system may have only a small number of speaker samples for training or fine - tuning. Therefore, how to achieve rapid and effective model adjustment with a small number of samples is also a key issue. ### Solutions To solve the above problems, this paper proposes a framework that combines a pre - trained audio front - end model (such as WavLM) and a neural network back - end with rapid fine - tuning for a small number of samples, and introduces the following key techniques: - **Speaker Reciprocal Points Learning (SRPL)**: By optimizing the embedding representations of known speakers, ensure that the distance between them and Reciprocal Points (RPs) is maximized, thereby improving the discrimination. - **SRPL+**: On the basis of SRPL, further introduce negative sample learning, using synthetic and real negative samples to enhance the generalization ability of the model, especially for the identification of unknown speakers. - **Zero - shot Text - to - Speech (Zero - shot TTS)**: Utilize zero - shot text - to - speech synthesis technology to generate negative samples to supplement the shortage of real negative samples and further improve the robustness of the model. ### Summary The main objective of this paper is to significantly improve the accuracy and robustness of open - set speaker identification, especially in complex multi - speaker scenarios in the home environment, by introducing new learning mechanisms and negative sample enhancement techniques. Experimental results show that the proposed SRPL and SRPL + methods have a significant performance improvement compared to existing methods.

Enhancing Open-Set Speaker Identification through Rapid Tuning with Speaker Reciprocal Points and Negative Sample

FenceSitter: Black-box, Content-Agnostic, and Synchronization-Free Enrollment-Phase Attacks on Speaker Recognition Systems

Improving Speaker Identification for Shared Devices by Adapting Embeddings to Speaker Subsets

OpenSR: Open-Modality Speech Recognition Via Maintaining Multi-Modality Alignment.

An optimized attention based hybrid deep learning framework for automatic speaker identification from speech signals

SLMIA-SR: Speaker-Level Membership Inference Attacks against Speaker Recognition Systems

Speaker recognition with two-step multi-modal deep cleansing

Post-Training Embedding Alignment for Decoupling Enrollment and Runtime Speaker Recognition Models

NPU Speaker Verification System for INTERSPEECH 2020 Far-Field Speaker Verification Challenge

HiddenSpeaker: Generate Imperceptible Unlearnable Audios for Speaker Verification System

Robust Speaker Extraction Network Based on Iterative Refined Adaptation

Implicit Enhancement of Target Speaker in Speaker-Adaptive ASR Through Efficient Joint Optimization

A Spatial Long-Term Iterative Mask Estimation Approach for Multi-Channel Speaker Diarization and Speech Recognition.

A Speaker Recognition Method Based on Stable Learning.

Enrollment-stage Backdoor Attacks on Speaker Recognition Systems via Adversarial Ultrasound

RIR-SF: Room Impulse Response Based Spatial Feature for Target Speech Recognition in Multi-Channel Multi-Speaker Scenarios

UTD-CRSS Systems for 2016 NIST Speaker Recognition Evaluation

Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances

Speaker-Invariant Training Via Adversarial Learning.

A Unified Deep Learning Framework for Short-Duration Speaker Verification in Adverse Environments

Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in Speaker Recognition