Post-Training Embedding Alignment for Decoupling Enrollment and Runtime Speaker Recognition Models

Chenyang Gao,Brecht Desplanques,Chelsea J.-T. Ju,Aman Chadha,Andreas Stolcke
2024-01-23
Abstract:Automated speaker identification (SID) is a crucial step for the personalization of a wide range of speech-enabled services. Typical SID systems use a symmetric enrollment-verification framework with a single model to derive embeddings both offline for voice profiles extracted from enrollment utterances, and online from runtime utterances. Due to the distinct circumstances of enrollment and runtime, such as different computation and latency constraints, several applications would benefit from an asymmetric enrollment-verification framework that uses different models for enrollment and runtime embedding generation. To support this asymmetric SID where each of the two models can be updated independently, we propose using a lightweight neural network to map the embeddings from the two independent models to a shared speaker embedding space. Our results show that this approach significantly outperforms cosine scoring in a shared speaker logit space for models that were trained with a contrastive loss on large datasets with many speaker identities. This proposed Neural Embedding Speaker Space Alignment (NESSA) combined with an asymmetric update of only one of the models delivers at least 60% of the performance gain achieved by updating both models in the standard symmetric SID approach.
Audio and Speech Processing,Machine Learning,Sound
What problem does this paper attempt to address?
This paper discusses how to solve the asymmetry problem in speaker recognition (SID) in the speech recognition system without updating the speaker model, by using Post-Training Embedding Alignment (POST-TRAINING EMBEDDING ALIGNMENT). Traditional SID systems use a single model to handle offline speaker archives and online speech streams. However, due to different training and runtime environments, such as computation and latency constraints, the paper proposes an asymmetric framework that uses different models for registration and runtime embedding generation. To address the mismatch between the models, the paper proposes a method called Neural Embedding Speaker Space Alignment (NESSA), which uses lightweight neural networks to map the embeddings generated by two independent models to a shared speaker embedding space. Experimental results show that this method performs significantly better than cosine similarity-based shared speaker log space scoring on large-scale contrastive loss training datasets. By combining asymmetric updates where only one model is updated, NESSA achieves at least 60% improvement in the performance compared to standard symmetric SID methods that update both models. The paper also points out that previous work mainly relies on the alignment of shared speaker log score space, which may be ineffective when using models trained with non-typical training objectives. In contrast, NESSA allows independent model updates without joint training of the models and improves performance without large-scale updates to the models. The experimental section demonstrates the performance of NESSA in different scenarios, including comparisons with baseline models, proving its advantages in the asymmetric framework. Furthermore, the paper explores the impact of using different numbers of training speakers and an enhanced version of contrastive learning (M3) on NESSA, showing that in specific cases, this approach can approach or exceed the performance of a standard symmetric framework that updates both models. In summary, this paper aims to address the model mismatch issue in asymmetric speaker recognition frameworks by implementing the NESSA method, which enables effective and independent model updates, improving system efficiency and performance.