Self-supervised Speaker Recognition Training Using Human-Machine Dialogues

Metehan Cekic,Ruirui Li,Zeya Chen,Yuguang Yang,Andreas Stolcke,Upamanyu Madhow
DOI: https://doi.org/10.1109/ICASSP43922.2022.9747325
2022-02-18
Abstract:Speaker recognition, recognizing speaker identities based on voice alone, enables important downstream applications, such as personalization and authentication. Learning speaker representations, in the context of supervised learning, heavily depends on both clean and sufficient labeled data, which is always difficult to acquire. Noisy unlabeled data, on the other hand, also provides valuable information that can be exploited using self-supervised training methods. In this work, we investigate how to pretrain speaker recognition models by leveraging dialogues between customers and smart-speaker devices. However, the supervisory information in such dialogues is inherently noisy, as multiple speakers may speak to a device in the course of the same dialogue. To address this issue, we propose an effective rejection mechanism that selectively learns from dialogues based on their acoustic homogeneity. Both reconstruction-based and contrastive-learning-based self-supervised methods are compared. Experiments demonstrate that the proposed method provides significant performance improvements, superior to earlier work. Dialogue pretraining when combined with the rejection mechanism yields 27.10% equal error rate (EER) reduction in speaker recognition, compared to a model without self-supervised pretraining.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to effectively pre - train a speaker recognition model in the absence of a large amount of clean labeled data?** Specifically, the author focuses on using unlabeled dialogue data between users and smart speaker devices for self - supervised learning to overcome the dependence on a large amount of clean labeled data in traditional supervised learning. ### Problem Background 1. **The Importance of Speaker Recognition** - Voice - based speaker recognition can be used in important applications such as personalized services and identity verification. - Supervised learning methods rely on a large amount of clean and fully - labeled data, but it is very difficult to obtain such data. 2. **The Value of Unlabeled Data** - Although unlabeled dialogue data has a lot of noise, it also contains valuable information and can be utilized through self - supervised learning methods. ### Main Challenges - **The Noise Problem of Dialogue Data**: In actual conversations, multiple speakers may speak in the same conversation, resulting in inaccurate supervision information. - **The Influence of Multiple Speakers**: If the model learns from conversations involving multiple speakers, it may introduce wrong learning signals and affect the model performance. ### Solutions To solve the above problems, the author proposes the following methods: 1. **Contrastive Learning Method**: - Use the customer's statements in the conversation as positive samples and the customer's statements in different conversations as negative samples, and train the model through contrastive learning. 2. **Rejection Mechanism**: - By calculating the compactness of the conversation (i.e., the similarity of all voice segments in the conversation), screen out relatively pure conversations to avoid the interference of multi - speaker conversations on model training. - For conversations with low compactness, reduce the weight of their loss functions, thereby reducing the influence of noisy conversations on the model. ### Experimental Results - **Performance Improvement**: Experiments show that the method combining conversation pre - training and the rejection mechanism can significantly reduce the equal error rate (EER) of the speaker recognition task. Compared with the model without self - supervised pre - training, the EER is reduced by 27.10%. - **Comparison with Other Methods**: This method is superior to other existing self - supervised pre - training methods and supervised learning methods based on large - scale labeled data in multiple evaluation indicators. ### Summary This paper proposes an effective self - supervised learning method, which can extract useful speaker information from unlabeled human - machine dialogue data and deal with the noise problem caused by multi - speaker conversations through the rejection mechanism, thereby significantly improving the performance of the speaker recognition model.