Multi-View Speaker Embedding Learning for Enhanced Stability and Discriminability.

Liang He,Zhihua Fang,Zuoer Chen,Minqiang Xu,Ying Meng,Penghao Wang
DOI: https://doi.org/10.1109/ICASSP48485.2024.10448494
2024-01-01
Abstract:Deep neural network models based on x-vector have become the most popular framework for speaker recognition, and the quality of speaker features (embeddings) is important for open-set tasks such as speaker verification and speaker diarization. Currently, the most popular loss function is based on margin penalty, however, it only considers enlarging the inter-class distance while neglecting to reduce the intra-class feature differences. Therefore, we propose a multi-view learning approach that divides the training process into two views from the speaker embedding level. The classification view focuses on distinguishing the discriminability of different speakers, while the clustering view focuses on shrinking the feature boundaries of the same speaker, making intra-class differences smaller. The combined effect of the two perspectives achieves large inter-class distance and small intra-class distances, resulting in the extraction of more discriminative and stable speaker embeddings. We test the performance of the method on both speaker verification and speaker diarization tasks, and the results demonstrate the effectiveness of our approach.
What problem does this paper attempt to address?