Introducing Multilingual Phonetic Information to Speaker Embedding for Speaker Verification

Zhida Song,Liang He,Penghao Wang,Ying Hu,Hao Huang
DOI: https://doi.org/10.1109/icassp48485.2024.10446546
2024-01-01
Abstract:Incorporating frame-level phonetic information during the extraction of speaker embeddings has been shown to enhance the performance of speaker verification systems. However, previous studies have primarily relied on phonetic information obtained from pre-trained models of monolingual automatic speech recognition (ASR). Considering that speaker verification datasets typically consist of multiple languages, there are instances where speakers are proficient in multiple languages, resulting in discrepancies between the languages used in the enrolled and test utterances. To address these challenges, we employ a pre-trained multilingual ASR Conformer encoder to initialize the MFA-Conformer network for speaker verification. Experimental results on the VoxCeleb dataset demonstrate a significant improvement in the performance of the system that incorporates multilingual phonetic information across different evaluation sets, including VoxCeleb1-O, E, and H, as well as the VoxSRC21 validation set, which focuses on multilingual verification. The source code is released at https://github.com/zds-potato/multilingual-phonetic-sv.
What problem does this paper attempt to address?