Variational methods for Conditional Multimodal Deep Learning

Gaurav Pandey,Ambedkar Dukkipati

DOI: https://doi.org/10.48550/arXiv.1603.01801

2016-08-26

Abstract:In this paper, we address the problem of conditional modality learning, whereby one is interested in generating one modality given the other. While it is straightforward to learn a joint distribution over multiple modalities using a deep multimodal architecture, we observe that such models aren't very effective at conditional generation. Hence, we address the problem by learning conditional distributions between the modalities. We use variational methods for maximizing the corresponding conditional log-likelihood. The resultant deep model, which we refer to as conditional multimodal autoencoder (CMMA), forces the latent representation obtained from a single modality alone to be `close' to the joint representation obtained from multiple modalities. We use the proposed model to generate faces from attributes. We show that the faces generated from attributes using the proposed model, are qualitatively and quantitatively more representative of the attributes from which they were generated, than those obtained by other deep generative models. We also propose a secondary task, whereby the existing faces are modified by modifying the corresponding attributes. We observe that the modifications in face introduced by the proposed model are representative of the corresponding modifications in attributes.

Computer Vision and Pattern Recognition,Machine Learning

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the problem of generating one modality given another modality, namely conditional modality learning. Although the joint distribution among multiple modalities can be easily learned using deep multimodal architectures, the author observes that such models are not very effective in conditional generation. Therefore, the author solves this problem by learning the conditional distribution between modalities, maximizing the corresponding conditional log - likelihood using the variational method. The proposed model is called Conditional Multimodal Autoencoder (CMMA), which forces the latent representation obtained from a single modality to be "close" to the joint representation obtained from multiple modalities. The paper shows how to use this model to generate faces from attributes and demonstrates that, compared with other deep generative models, the faces generated using this model are more representative of the attributes that generate them both qualitatively and quantitatively. In addition, the paper also proposes a secondary task, that is, to modify existing faces by modifying the corresponding attributes. The results show that the face modifications introduced by this model can represent the corresponding attribute modifications.

Variational methods for Conditional Multimodal Deep Learning

Joint Multimodal Learning with Deep Generative Models

Data-Dependent Conditional Priors for Unsupervised Learning of Multimodal Data

Deep Vision Multimodal Learning: Methodology, Benchmark, and Trend

Modality-invariant Temporal Representation Learning for Multimodal Sentiment Classification

Discriminative multimodal learning via conditional priors in generative models

Multimodal Adversarially Learned Inference with Factorized Discriminators

Improving Multimodal Joint Variational Autoencoders through Normalizing Flows and Correlation Analysis

Generalizing Multimodal Variational Methods to Sets

Multi-Modal Latent Diffusion

Multilinear Latent Conditioning for Generating Unseen Attribute Combinations

Improving Bi-directional Generation between Different Modalities with Variational Autoencoders

MHVAE: a Human-Inspired Deep Hierarchical Generative Model for Multimodal Representation Learning

Multimodal Generative Models for Scalable Weakly-Supervised Learning

Learning more expressive joint distributions in multimodal variational methods

Multimodal Generative Models for Compositional Representation Learning

Multimodal Weibull Variational Autoencoder for Jointly Modeling Image-Text Data

Learning Structured Output Representations from Attributes using Deep Conditional Generative Models

Multi-modal data generation with a deep metric variational autoencoder

A survey of multimodal deep generative models

Partial Modal Conditioned GANs for Multi-modal Multi-label Learning with Arbitrary Modal-Missing