Abstract:We introduce a conditional generative model for learning to disentangle the hidden factors of variation within a set of labeled observations, and separate them into complementary codes. One code summarizes the specified factors of variation associated with the labels. The other summarizes the remaining unspecified variability. During training, the only available source of supervision comes from our ability to distinguish among different observations belonging to the same class. Examples of such observations include images of a set of labeled objects captured at different viewpoints, or recordings of set of speakers dictating multiple phrases. In both instances, the intra-class diversity is the source of the unspecified factors of variation: each object is observed at multiple viewpoints, and each speaker dictates multiple phrases. Learning to disentangle the specified factors from the unspecified ones becomes easier when strong supervision is possible. Suppose that during training, we have access to pairs of images, where each pair shows two different objects captured from the same viewpoint. This source of alignment allows us to solve our task using existing methods. However, labels for the unspecified factors are usually unavailable in realistic scenarios where data acquisition is not strictly controlled. We address the problem of disentaglement in this more general setting by combining deep convolutional autoencoders with a form of adversarial training. Both factors of variation are implicitly captured in the organization of the learned embedding space, and can be used for solving single-image analogies. Experimental results on synthetic and real datasets show that the proposed method is capable of generalizing to unseen classes and intra-class variabilities.

An empirical analysis of information encoded in disentangled neural speaker representations

Disentangling Textual and Acoustic Features of Neural Speech Representations

Investigating Speaker Embedding Disentanglement on Natural Read Speech

Disentangled Representation Learning for Environment-agnostic Speaker Recognition

Noise-Disentanglement Metric Learning for Robust Speaker Verification

Self-Supervised Disentangled Representation Learning for Robust Target Speech Extraction

Powerful Speaker Embedding Training Framework by Adversarially Disentangled Identity Representation

Disentangling Factors of Variation in Deep Representations Using Adversarial Training.

Residual Information in Deep Speaker Embedding Architectures

Towards the Next Frontier in Speech Representation Learning Using Disentanglement

Disentangling speech from surroundings with neural embeddings

Speaker disentanglement in video-to-speech conversion

DEAAN: Disentangled Embedding and Adversarial Adaptation Network for Robust Speaker Representation Learning

Bridging Disentanglement with Independence and Conditional Independence via Mutual Information for Representation Learning

On Feature Importance and Interpretability of Speaker Representations

Bridging Disentanglement with Independence and Conditional Independence Via Mutual Information for Representation Learning.

Improving Robustness and Generality of NLP Models Using Disentangled Representations

A Step Towards Preserving Speakers' Identity While Detecting Depression Via Speaker Disentanglement

A Joint Noise Disentanglement and Adversarial Training Framework for Robust Speaker Verification

Disentanglement Network: Disentangle the Emotional Features from Acoustic Features for Speech Emotion Recognition

Semantic Disentangling for Audiovisual Induced Emotion