Abstract:Automated speaker identification (SID) is a crucial step for the personalization of a wide range of speech-enabled services. Typical SID systems use a symmetric enrollment-verification framework with a single model to derive embeddings both offline for voice profiles extracted from enrollment utterances, and online from runtime utterances. Due to the distinct circumstances of enrollment and runtime, such as different computation and latency constraints, several applications would benefit from an asymmetric enrollment-verification framework that uses different models for enrollment and runtime embedding generation. To support this asymmetric SID where each of the two models can be updated independently, we propose using a lightweight neural network to map the embeddings from the two independent models to a shared speaker embedding space. Our results show that this approach significantly outperforms cosine scoring in a shared speaker logit space for models that were trained with a contrastive loss on large datasets with many speaker identities. This proposed Neural Embedding Speaker Space Alignment (NESSA) combined with an asymmetric update of only one of the models delivers at least 60% of the performance gain achieved by updating both models in the standard symmetric SID approach.

What problem does this paper attempt to address?

This paper discusses how to solve the asymmetry problem in speaker recognition (SID) in the speech recognition system without updating the speaker model, by using Post-Training Embedding Alignment (POST-TRAINING EMBEDDING ALIGNMENT). Traditional SID systems use a single model to handle offline speaker archives and online speech streams. However, due to different training and runtime environments, such as computation and latency constraints, the paper proposes an asymmetric framework that uses different models for registration and runtime embedding generation. To address the mismatch between the models, the paper proposes a method called Neural Embedding Speaker Space Alignment (NESSA), which uses lightweight neural networks to map the embeddings generated by two independent models to a shared speaker embedding space. Experimental results show that this method performs significantly better than cosine similarity-based shared speaker log space scoring on large-scale contrastive loss training datasets. By combining asymmetric updates where only one model is updated, NESSA achieves at least 60% improvement in the performance compared to standard symmetric SID methods that update both models. The paper also points out that previous work mainly relies on the alignment of shared speaker log score space, which may be ineffective when using models trained with non-typical training objectives. In contrast, NESSA allows independent model updates without joint training of the models and improves performance without large-scale updates to the models. The experimental section demonstrates the performance of NESSA in different scenarios, including comparisons with baseline models, proving its advantages in the asymmetric framework. Furthermore, the paper explores the impact of using different numbers of training speakers and an enhanced version of contrastive learning (M3) on NESSA, showing that in specific cases, this approach can approach or exceed the performance of a standard symmetric framework that updates both models. In summary, this paper aims to address the model mismatch issue in asymmetric speaker recognition frameworks by implementing the NESSA method, which enables effective and independent model updates, improving system efficiency and performance.

Post-Training Embedding Alignment for Decoupling Enrollment and Runtime Speaker Recognition Models

Joint speaker encoder and neural back-end model for fully end-to-end automatic speaker verification with multiple enrollment utterances

Adapting End-to-End Neural Speaker Verification to New Languages and Recording Conditions with Adversarial Training

Improving Speaker Identification for Shared Devices by Adapting Embeddings to Speaker Subsets

ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings

Powerful Speaker Embedding Training Framework by Adversarially Disentangled Identity Representation

A Speaker-Dependent Approach to Single-Channel Joint Speech Separation and Acoustic Modeling Based on Deep Neural Networks for Robust Recognition of Multi-Talker Speech

Multi-View Speaker Embedding Learning for Enhanced Stability and Discriminability.

Neural Scoring, Not Embedding: A Novel Framework for Robust Speaker Verification

A Speaker-Dependent Deep Learning Approach to Joint Speech Separation and Acoustic Modeling for Multi-Talker Automatic Speech Recognition

DSARSR: Deep Stacked Auto-encoders Enhanced Robust Speaker Recognition

Adapting Self-Supervised Models to Multi-Talker Speech Recognition Using Speaker Embeddings

Two Methods for Spoofing-Aware Speaker Verification: Multi-Layer Perceptron Score Fusion Model and Integrated Embedding Projector

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification

Cross-Speaker Encoding Network for Multi-Talker Speech Recognition

Analyzing And Improving Neural Speaker Embeddings for ASR

EEND-SS: Joint End-to-End Neural Speaker Diarization and Speech Separation for Flexible Number of Speakers

Deep Speaker: an End-to-End Neural Speaker Embedding System

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

Towards Robust Speaker Verification with Target Speaker Enhancement

Adversarial Speaker Verification.