Prototype Division for Self-Supervised Speaker Verification

Zhenduo Zhao,Zhuo Li,Xueshuai Zhang,Wenchao Wang,Pengyuan Zhang
DOI: https://doi.org/10.1109/lsp.2024.3377593
2024-03-30
IEEE Signal Processing Letters
Abstract:Self-supervised learning has shown promising performance on speaker verification tasks, among which Self-DIstillation with NO labels (DINO) is currently a widely adopted framework. As one of the unsupervised deep clustering methods, the number of valid prototypes in DINO is far less than the speakers in practical applications and remains unchanged throughout the training period, leading to severe speaker confusion and performance degradation. Therefore, a strategy named prototype division (PD) is proposed to iteratively generate fine-grained prototypes in the projection space based on the converged model to separate confused categories, where new prototypes are derived from the neighborhood of the existing valid prototypes by clustering or sampling. The results on Vox1O achieve significant improvements, relatively outperforming the baseline by 31.1% without any auxiliary loss. Further experiments on CN-Celeb also show stable improvement, proving the consistency of the proposed method.
engineering, electrical & electronic
What problem does this paper attempt to address?