Phoneme-Aware Adaptation with Discrepancy Minimization and Dynamically-Classified Vector for Text-independent Speaker Verification
Jia Wang,Tianhao Lan,Jie Chen,Chengwen Luo,Chao Wu,Jianqiang Li
DOI: https://doi.org/10.1145/3503161.3548240
2022-01-01
Abstract:Recent studies show that introducing phonetic information into multi-task learning could significantly improve the performance of speaker embedding extraction. However, benefits of such architectures usually depend largely on the availibility of a well-matched dataset, and domain or language mismatch would result in obvious dropdown in performance. Meanwhile, the utilization of these massive mismatched data and application of these auxiliary tasks may bring many rich features that could be exploited. In this paper, we propose a phoneme-aware adaptation network with discrepancy minimization and dynamically-classified vector for text-independent speaker verification to address these abovementioned challenges. More specifically, our method first utilize the maximum mean discrepancy (MMD) as part of the total loss function to solve the mismatch between training data of the speaker subnet and the phoneme subnet. And then we use a dynamically-classified vector-guided softmax loss (DV-Softmax), which could adaptively emphasize different high-quality features and dynamically change their weights, to guide the discriminative speaker embedding. Experimental results on VoxCeleb1 data set confirmed its superiority against the other state-of-the-art phoneme adaptation methods, providing approximately 15% relative improvements in equal error rate (EER).