Abstract:Fixed-dimensional speaker embeddings have become the dominant approach in speaker modeling, typically spanning hundreds to thousands of dimensions. These dimensions are hyperparameters that are not specifically picked, nor are they hierarchically ordered in terms of importance. In large-scale speaker representation databases, reducing the dimensionality of embeddings can significantly lower storage and computational costs. However, directly training low-dimensional representations often yields suboptimal performance. In this paper, we introduce the Matryoshka speaker embedding, a method that allows dynamic extraction of sub-dimensions from the embedding while maintaining performance. Our approach is validated on the VoxCeleb dataset, demonstrating that it can achieve extremely low-dimensional embeddings, such as 8 dimensions, while preserving high speaker verification performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: **How to flexibly extract speaker embeddings of different dimensions while maintaining high performance, so as to adapt to different application scenarios and resource limitations?** Specifically, traditional speaker embedding methods usually use fixed - dimensional vectors to represent speaker features. These dimensions are often empirically selected and do not have a clear hierarchical structure or importance ranking. This results in high - dimensional embeddings incurring high storage and computational costs when searching in large - scale databases; while directly training low - dimensional embeddings often leads to performance degradation. To solve these problems, this paper proposes the **Matryoshka Speaker Embedding (M - Vec)** method, which allows dynamic extraction of sub - dimensions, thus still maintaining high speaker verification performance even in extremely low - dimensional cases. ### Main problem summary: 1. **Limitations of fixed dimensions**: Traditional methods use fixed - dimensional embeddings and cannot flexibly adapt to the requirements of different application scenarios. 2. **Cost problems of high - dimensional embeddings**: High - dimensional embeddings will significantly increase storage and computational costs in large - scale databases. 3. **Performance problems of low - dimensional embeddings**: Directly training low - dimensional embeddings often reduces performance, especially in extremely low - dimensional cases. ### Solutions: The paper proposes a method named **Matryoshka Representation Learning (MRL)** that can simultaneously optimize embeddings of multiple dimensions during the training process, enabling the model to flexibly extract embeddings of different dimensions during inference without affecting performance. In this way, MRL can still maintain high speaker verification performance even in extremely low - dimensional cases (such as 8 - dimensional or 16 - dimensional), thereby effectively reducing storage and computational costs. ### Key contributions: 1. **First exploration of variable - dimensional speaker representations**: Allows different tasks to select appropriate embedding dimensions according to requirements. 2. **Proposing the MRL training method**: Can perform speaker discrimination training on multiple dimensions simultaneously. 3. **Enhancing the expressiveness of low - dimensional embeddings**: Even in extremely low - dimensional cases, the embeddings still have strong discrimination ability. Through these innovations, this paper provides a more flexible and efficient solution for the field of speaker recognition, especially performing well in resource - constrained scenarios.

M-Vec: Matryoshka Speaker Embeddings with Flexible Dimensions

Multi-View Speaker Embedding Learning for Enhanced Stability and Discriminability.

Exploiting Speaker Embeddings for Improved Microphone Clustering and Speech Separation in ad-hoc Microphone Arrays

Deep Speaker: an End-to-End Neural Speaker Embedding System

Powerful Speaker Embedding Training Framework by Adversarially Disentangled Identity Representation

Matryoshka-Adaptor: Unsupervised and Supervised Tuning for Smaller Embedding Dimensions

Distilling Multi-Level X-vector Knowledge for Small-footprint Speaker Verification

Probing Deep Speaker Embeddings for Speaker-related Tasks

Y-Vector: Multiscale Waveform Encoder for Speaker Embedding

Wav2sv: End-to-end Speaker Embeddings Learning from Raw Waveforms Based on Metric Learning for Speaker Verification.

2D Matryoshka Sentence Embeddings

Reshape Dimensions Network for Speaker Recognition

Deep Speaker Embedding Learning with Multi-level Pooling for Text-independent Speaker Verification

Phoneme Dependent Speaker Embedding And Model Factorization For Multi-Speaker Speech Synthesis And Adaptation

Residual Information in Deep Speaker Embedding Architectures

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

Improving Speaker Identification for Shared Devices by Adapting Embeddings to Speaker Subsets

Supervised Speaker Embedding De-Mixing in Two-Speaker Environment

Integration of audio-visual information for multi-speaker multimedia speaker recognition

Memory-Efficient Training for Deep Speaker Embedding Learning in Speaker Verification

Experimental evaluation of a new speaker identification framework using PCA.