Abstract:Speaker recognition, recognizing speaker identities based on voice alone, enables important downstream applications, such as personalization and authentication. Learning speaker representations, in the context of supervised learning, heavily depends on both clean and sufficient labeled data, which is always difficult to acquire. Noisy unlabeled data, on the other hand, also provides valuable information that can be exploited using self-supervised training methods. In this work, we investigate how to pretrain speaker recognition models by leveraging dialogues between customers and smart-speaker devices. However, the supervisory information in such dialogues is inherently noisy, as multiple speakers may speak to a device in the course of the same dialogue. To address this issue, we propose an effective rejection mechanism that selectively learns from dialogues based on their acoustic homogeneity. Both reconstruction-based and contrastive-learning-based self-supervised methods are compared. Experiments demonstrate that the proposed method provides significant performance improvements, superior to earlier work. Dialogue pretraining when combined with the rejection mechanism yields 27.10% equal error rate (EER) reduction in speaker recognition, compared to a model without self-supervised pretraining.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to effectively pre - train a speaker recognition model in the absence of a large amount of clean labeled data?** Specifically, the author focuses on using unlabeled dialogue data between users and smart speaker devices for self - supervised learning to overcome the dependence on a large amount of clean labeled data in traditional supervised learning. ### Problem Background 1. **The Importance of Speaker Recognition** - Voice - based speaker recognition can be used in important applications such as personalized services and identity verification. - Supervised learning methods rely on a large amount of clean and fully - labeled data, but it is very difficult to obtain such data. 2. **The Value of Unlabeled Data** - Although unlabeled dialogue data has a lot of noise, it also contains valuable information and can be utilized through self - supervised learning methods. ### Main Challenges - **The Noise Problem of Dialogue Data**: In actual conversations, multiple speakers may speak in the same conversation, resulting in inaccurate supervision information. - **The Influence of Multiple Speakers**: If the model learns from conversations involving multiple speakers, it may introduce wrong learning signals and affect the model performance. ### Solutions To solve the above problems, the author proposes the following methods: 1. **Contrastive Learning Method**: - Use the customer's statements in the conversation as positive samples and the customer's statements in different conversations as negative samples, and train the model through contrastive learning. 2. **Rejection Mechanism**: - By calculating the compactness of the conversation (i.e., the similarity of all voice segments in the conversation), screen out relatively pure conversations to avoid the interference of multi - speaker conversations on model training. - For conversations with low compactness, reduce the weight of their loss functions, thereby reducing the influence of noisy conversations on the model. ### Experimental Results - **Performance Improvement**: Experiments show that the method combining conversation pre - training and the rejection mechanism can significantly reduce the equal error rate (EER) of the speaker recognition task. Compared with the model without self - supervised pre - training, the EER is reduced by 27.10%. - **Comparison with Other Methods**: This method is superior to other existing self - supervised pre - training methods and supervised learning methods based on large - scale labeled data in multiple evaluation indicators. ### Summary This paper proposes an effective self - supervised learning method, which can extract useful speaker information from unlabeled human - machine dialogue data and deal with the noise problem caused by multi - speaker conversations through the rejection mechanism, thereby significantly improving the performance of the speaker recognition model.

Self-supervised Speaker Recognition Training Using Human-Machine Dialogues

Explore the Use of Self-supervised Pre-trained Acoustic Features on Disguised Speech Detection

Preliminary Study on Self-contained UBM Construction for Speaker Recognition.

Self-Supervised Learning with Cluster-Aware-DINO for High-Performance Robust Speaker Verification

Self-supervised Reflective Learning through Self-distillation and Online Clustering for Speaker Representation Learning

DialogueBERT: A Self-Supervised Learning based Dialogue Pre-training Encoder

An Efficient Self-Learning Framework For Interactive Spoken Dialog Systems

Fast Yet Effective Speech Emotion Recognition with Self-distillation

Task Oriented Dialogue as a Catalyst for Self-Supervised Automatic Speech Recognition

Self-Supervised Learning Based Domain Adaptation for Robust Speaker Verification

Self-attention Based Speaker Recognition Using Cluster-Range Loss

Weakly-Supervised Speech Pre-training: A Case Study on Target Speech Recognition

Efficient Personalized Speech Enhancement through Self-Supervised Learning

Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in Speaker Recognition

Self-Supervised Learning from Contrastive Mixtures for Personalized Speech Enhancement

Self-training Improves Pre-training for Few-shot Learning in Task-oriented Dialog Systems

Self-Supervised Speaker Verification with Mini-Batch Prediction Correction

Consistency Based Unsupervised Self-training For ASR Personalisation

Improving Audio-visual Speech Recognition Performance with Cross-modal Student-teacher Training

A Joint Speech Enhancement and Self-Supervised Representation Learning Framework for Noise-Robust Speech Recognition