Abstract:Current mainstream speaker verification systems are predominantly based on the concept of ``speaker embedding", which transforms variable-length speech signals into fixed-length speaker vectors, followed by verification based on cosine similarity between the embeddings of the enrollment and test utterances. However, this approach suffers from considerable performance degradation in the presence of severe noise and interference speakers. This paper introduces Neural Scoring, a novel framework that re-treats speaker verification as a scoring task using a Transformer-based architecture. The proposed method first extracts an embedding from the enrollment speech and frame-level features from the test speech. A Transformer network then generates a decision score that quantifies the likelihood of the enrolled speaker being present in the test speech. We evaluated Neural Scoring on the VoxCeleb dataset across five test scenarios, comparing it with the state-of-the-art embedding-based approach. While Neural Scoring achieves comparable performance to the state-of-the-art under the benchmark (clean) test condition, it demonstrates a remarkable advantage in the four complex scenarios, achieving an overall 64.53% reduction in equal error rate (EER) compared to the baseline.

Challenging margin-based speaker embedding extractors by using the variational information bottleneck

Large Margin Softmax Loss for Speaker Verification

Angular Softmax Loss for End-to-end Speaker Verification.

Improved Large-Margin Softmax Loss for Speaker Diarisation

Ensemble Additive Margin Softmax for Speaker Verification

Margin-Mixup: A Method for Robust Speaker Verification in Multi-Speaker Audio

Tackling the Score Shift in Cross-Lingual Speaker Verification by Exploiting Language Information

Multi-View Speaker Embedding Learning for Enhanced Stability and Discriminability.

Exploiting Speaker Embeddings for Improved Microphone Clustering and Speech Separation in ad-hoc Microphone Arrays

Max-margin Metric Learning for Speaker Recognition

Experimenting with Additive Margins for Contrastive Self-Supervised Speaker Verification

Neural Scoring, Not Embedding: A Novel Framework for Robust Speaker Verification

Rethinking Session Variability: Leveraging Session Embeddings for Session Robustness in Speaker Verification

Powerful Speaker Embedding Training Framework by Adversarially Disentangled Identity Representation

Real Additive Margin Softmax for Speaker Verification

Incorporating Uncertainty from Speaker Embedding Estimation to Speaker Verification

Adapting End-to-End Neural Speaker Verification to New Languages and Recording Conditions with Adversarial Training

Speaker disentanglement in video-to-speech conversion

A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification