Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels

Zakaria Aldeneh,Takuya Higuchi,Jee-weon Jung,Li-Wei Chen,Stephen Shum,Ahmed Hussen Abdelaziz,Shinji Watanabe,Tatiana Likhomanenko,Barry-John Theobald
2024-09-17
Abstract:Iterative self-training, or iterative pseudo-labeling (IPL)--using an improved model from the current iteration to provide pseudo-labels for the next iteration--has proven to be a powerful approach to enhance the quality of speaker representations. Recent applications of IPL in unsupervised speaker recognition start with representations extracted from very elaborate self-supervised methods (e.g., DINO). However, training such strong self-supervised models is not straightforward (they require hyper-parameters tuning and may not generalize to out-of-domain data) and, moreover, may not be needed at all. To this end, we show the simple, well-studied, and established i-vector generative model is enough to bootstrap the IPL process for unsupervised learning of speaker representations. We also systematically study the impact of other components on the IPL process, which includes the initial model, the encoder, augmentations, the number of clusters, and the clustering algorithm. Remarkably, we find that even with a simple and significantly weaker initial model like i-vector, IPL can still achieve speaker verification performance that rivals state-of-the-art methods.
Audio and Speech Processing,Sound
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in an unsupervised learning environment, how to effectively use Iterative Pseudo - Labeling (IPL) to improve the quality of speaker representations, especially without the need for complex self - supervised models. Specifically, the authors explored whether a simple and well - established i - vector generation model can be used to guide the IPL process, thereby achieving speaker verification performance comparable to current state - of - the - art methods. ### Main problems 1. **Whether complex self - supervised models are required**: Training powerful self - supervised models (such as DINO) is very cumbersome, requiring the tuning of multiple hyper - parameters, and may not generalize well to out - of - domain data. Therefore, the authors raised the question: is such a complex model really necessary? 2. **Feasibility of simple models**: Can a simple, not deeply optimized model (such as i - vector) be used to guide the IPL process to reduce the dependence on complex models? ### Solutions The authors solved the above problems in the following ways: - **Using i - vector to guide IPL**: Research shows that even if a simple i - vector model is used as the initial model, IPL can still achieve speaker verification performance comparable to state - of - the - art methods. - **Systematically analyzing influencing factors**: The authors also systematically studied the influence of other components in the IPL framework, including the initial model, encoder, data augmentation, number of clusters, and clustering algorithms, etc. ### Experimental results The experimental results show that after multiple iterations, IPL guided by i - vector can achieve equal error rates (EER) of 1.79% and 1.14% on the VoxCeleb1 - O test set, which is better than many existing methods. Especially when using the MFA - Conformer encoder, the EER is even lower than most existing methods. ### Conclusions This research proves that even if a weaker initial model (such as i - vector) is used, performance comparable to strong self - supervised models can be achieved through IPL. In addition, the research also shows that in unsupervised learning, choosing appropriate hyper - parameters such as encoders and clustering algorithms is more important than the choice of the initial model. ### Formula summary The decomposition formula of the i - vector model is: \[ M = m+Tw \] where: - \( M \) is a supervector that depends on the speaker and the session. - \( m \) is a supervector independent of the speaker and the session from the Universal Background Model (UBM). - \( T \) is a rectangular matrix that defines the total variability space. - \( w \) is a low - dimensional random intermediate vector (i - vector) with a prior distribution of \( N(0, I) \). - The variability not captured by \( T \) is captured by the covariance matrix \( \Sigma \) of the model. Through these studies, the authors demonstrated the potential of simple models in unsupervised learning and provided valuable references for future research.