Abstract:Iterative self-training, or iterative pseudo-labeling (IPL)--using an improved model from the current iteration to provide pseudo-labels for the next iteration--has proven to be a powerful approach to enhance the quality of speaker representations. Recent applications of IPL in unsupervised speaker recognition start with representations extracted from very elaborate self-supervised methods (e.g., DINO). However, training such strong self-supervised models is not straightforward (they require hyper-parameters tuning and may not generalize to out-of-domain data) and, moreover, may not be needed at all. To this end, we show the simple, well-studied, and established i-vector generative model is enough to bootstrap the IPL process for unsupervised learning of speaker representations. We also systematically study the impact of other components on the IPL process, which includes the initial model, the encoder, augmentations, the number of clusters, and the clustering algorithm. Remarkably, we find that even with a simple and significantly weaker initial model like i-vector, IPL can still achieve speaker verification performance that rivals state-of-the-art methods.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in an unsupervised learning environment, how to effectively use Iterative Pseudo - Labeling (IPL) to improve the quality of speaker representations, especially without the need for complex self - supervised models. Specifically, the authors explored whether a simple and well - established i - vector generation model can be used to guide the IPL process, thereby achieving speaker verification performance comparable to current state - of - the - art methods. ### Main problems 1. **Whether complex self - supervised models are required**: Training powerful self - supervised models (such as DINO) is very cumbersome, requiring the tuning of multiple hyper - parameters, and may not generalize well to out - of - domain data. Therefore, the authors raised the question: is such a complex model really necessary? 2. **Feasibility of simple models**: Can a simple, not deeply optimized model (such as i - vector) be used to guide the IPL process to reduce the dependence on complex models? ### Solutions The authors solved the above problems in the following ways: - **Using i - vector to guide IPL**: Research shows that even if a simple i - vector model is used as the initial model, IPL can still achieve speaker verification performance comparable to state - of - the - art methods. - **Systematically analyzing influencing factors**: The authors also systematically studied the influence of other components in the IPL framework, including the initial model, encoder, data augmentation, number of clusters, and clustering algorithms, etc. ### Experimental results The experimental results show that after multiple iterations, IPL guided by i - vector can achieve equal error rates (EER) of 1.79% and 1.14% on the VoxCeleb1 - O test set, which is better than many existing methods. Especially when using the MFA - Conformer encoder, the EER is even lower than most existing methods. ### Conclusions This research proves that even if a weaker initial model (such as i - vector) is used, performance comparable to strong self - supervised models can be achieved through IPL. In addition, the research also shows that in unsupervised learning, choosing appropriate hyper - parameters such as encoders and clustering algorithms is more important than the choice of the initial model. ### Formula summary The decomposition formula of the i - vector model is: \[ M = m+Tw \] where: - \( M \) is a supervector that depends on the speaker and the session. - \( m \) is a supervector independent of the speaker and the session from the Universal Background Model (UBM). - \( T \) is a rectangular matrix that defines the total variability space. - \( w \) is a low - dimensional random intermediate vector (i - vector) with a prior distribution of \( N(0, I) \). - The variability not captured by \( T \) is captured by the covariance matrix \( \Sigma \) of the model. Through these studies, the authors demonstrated the potential of simple models in unsupervised learning and provided valuable references for future research.

Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels

Maximum Likelihood I-Vector Space Using PCA for Speaker Verification.

VarASV: Enabling Pitch-variable Automatic Speaker Verification Via Multi-task Learning

Emotional speaker recognition based on i-vector through Atom Aligned Sparse Representation

Deep neural network based i-vector mapping for speaker verification using short utterances

Deep Discriminant Analysis for i-vector Based Robust Speaker Recognition

A Novel I-Vector Framework Using Multiple Features and PCA for Speaker Recognition in Short Speech Condition

Self-Supervised Learning with Cluster-Aware-DINO for High-Performance Robust Speaker Verification

Investigation of Using VAE for i-Vector Speaker Verification

Neural Predictive Coding using Convolutional Neural Networks towards Unsupervised Learning of Speaker Characteristics

Prototype Division for Self-Supervised Speaker Verification

Improving Deep Neural Networks Based Speaker Verification Using Unlabeled Data

An analytic study on clustering driven self-supervised speaker verification

DSARSR: Deep Stacked Auto-encoders Enhanced Robust Speaker Recognition

Self-supervised Reflective Learning through Self-distillation and Online Clustering for Speaker Representation Learning

Deep neural networks based speaker modeling at different levels of phonetic granularity

Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning

End-to-end DNN Based Speaker Recognition Inspired by i-vector and PLDA

Discriminative scoring for speaker recognition based on I-vectors

Leveraging In-the-Wild Data for Effective Self-Supervised Pretraining in Speaker Recognition