Speaker Recognition Based on Pre-Trained Model and Deep Clustering

Liang He,Zhida Song,Shuanghong Liu,Mengqi Niu,Ying Hu,Hao Huang
DOI: https://doi.org/10.1109/icme57554.2024.10687367
2024-01-01
Abstract:In this paper, we propose a novel loss by integrating a deep clustering (DC) loss at the frame-level and a speaker recognition loss at the segment-level into a single network without additional data requirements and exhaustive computation. The DC loss implicitly generates soft pseudo-phoneme labels for each frame-level feature, which facilitates extracting more discriminant speaker representation by suppressing phonetic content information. We study the DC loss not only on the acoustic feature, but also on the features extracted by the pre-trained models, such as wav2vec 2.0, HuBERT and WavLM. Experimental results on the VoxCeleb dataset shows that the overall system performance based on the pre-trained model features are better than the one on the acoustic feature. The proposed loss is significantly effective for systems on the acoustic feature and has a marginal improvement for systems on the pre-trained model feature.
What problem does this paper attempt to address?