Variational methods for Conditional Multimodal Deep Learning

Gaurav Pandey,Ambedkar Dukkipati
DOI: https://doi.org/10.48550/arXiv.1603.01801
2016-08-26
Abstract:In this paper, we address the problem of conditional modality learning, whereby one is interested in generating one modality given the other. While it is straightforward to learn a joint distribution over multiple modalities using a deep multimodal architecture, we observe that such models aren't very effective at conditional generation. Hence, we address the problem by learning conditional distributions between the modalities. We use variational methods for maximizing the corresponding conditional log-likelihood. The resultant deep model, which we refer to as conditional multimodal autoencoder (CMMA), forces the latent representation obtained from a single modality alone to be `close' to the joint representation obtained from multiple modalities. We use the proposed model to generate faces from attributes. We show that the faces generated from attributes using the proposed model, are qualitatively and quantitatively more representative of the attributes from which they were generated, than those obtained by other deep generative models. We also propose a secondary task, whereby the existing faces are modified by modifying the corresponding attributes. We observe that the modifications in face introduced by the proposed model are representative of the corresponding modifications in attributes.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the problem of generating one modality given another modality, namely conditional modality learning. Although the joint distribution among multiple modalities can be easily learned using deep multimodal architectures, the author observes that such models are not very effective in conditional generation. Therefore, the author solves this problem by learning the conditional distribution between modalities, maximizing the corresponding conditional log - likelihood using the variational method. The proposed model is called Conditional Multimodal Autoencoder (CMMA), which forces the latent representation obtained from a single modality to be "close" to the joint representation obtained from multiple modalities. The paper shows how to use this model to generate faces from attributes and demonstrates that, compared with other deep generative models, the faces generated using this model are more representative of the attributes that generate them both qualitatively and quantitatively. In addition, the paper also proposes a secondary task, that is, to modify existing faces by modifying the corresponding attributes. The results show that the face modifications introduced by this model can represent the corresponding attribute modifications.