Abstract:The primary characteristic of robust speaker representations is that they are invariant to factors of variability not related to speaker identity. Disentanglement of speaker representations is one of the techniques used to improve robustness of speaker representations to both intrinsic factors that are acquired during speech production (e.g., emotion, lexical content) and extrinsic factors that are acquired during signal capture (e.g., channel, noise). Disentanglement in neural speaker representations can be achieved either in a supervised fashion with annotations of the nuisance factors (factors not related to speaker identity) or in an unsupervised fashion without labels of the factors to be removed. In either case it is important to understand the extent to which the various factors of variability are entangled in the representations. In this work, we examine speaker representations with and without unsupervised disentanglement for the amount of information they capture related to a suite of factors. Using classification experiments we provide empirical evidence that disentanglement reduces the information with respect to nuisance factors from speaker representations, while retaining speaker information. This is further validated by speaker verification experiments on the VOiCES corpus in several challenging acoustic conditions. We also show improved robustness in speaker verification tasks using data augmentation during training of disentangled speaker embeddings. Finally, based on our findings, we provide insights into the factors that can be effectively separated using the unsupervised disentanglement technique and discuss potential future directions.

Noise-Disentanglement Metric Learning for Robust Speaker Verification

A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification

An empirical analysis of information encoded in disentangled neural speaker representations

Disentangled Representation Learning for Environment-agnostic Speaker Recognition

A novel speaker verification approach for certain noisy environment

Powerful Speaker Embedding Training Framework by Adversarially Disentangled Identity Representation

Wav2sv: End-to-end Speaker Embeddings Learning from Raw Waveforms Based on Metric Learning for Speaker Verification.

Learning Discriminative Speaker Embedding by Improving Aggregation Strategy and Loss Function for Speaker Verification

META-LEARNING FOR CROSS-CHANNEL SPEAKER VERIFICATION

Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction

Deep nonlinear metric learning for speaker verification

Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System.

HiddenSpeaker: Generate Imperceptible Unlearnable Audios for Speaker Verification System

Noise-robustness of speaker verification based on the perceptual log area ratio

DEAAN: Disentangled Embedding and Adversarial Adaptation Network for Robust Speaker Representation Learning

A speech enhancement model based on noise component decomposition: Inspired by human cognitive behavior

EAD-VC: Enhancing Speech Auto-Disentanglement for Voice Conversion with IFUB Estimator and Joint Text-Guided Consistent Learning

Towards Robust Speaker Verification with Target Speaker Enhancement

Deep Nonlinear Metric Learning for Speaker Verification in the I-Vector Space

Extraction of Noise-Robust Speaker Embedding Based on Generative Adversarial Networks

Incorporating Uncertainty from Speaker Embedding Estimation to Speaker Verification