Stacking VAE and GAN for Context-aware Text-to-Image Generation

Chenrui Zhang,Yuxin Peng
DOI: https://doi.org/10.1109/bigmm.2018.8499439
2018-01-01
Abstract:Generating high-quality images based on text descriptions is an appealing research topic, which has widespread applications in various fields. However, it is quite challenging since that images and language descriptions in real world are noisy with great variability. Most existing text-to-image methods aim to generate images in a holistic manner, which ignore the difference between images' foreground and background, resulting in that objects in images are easily disturbed by the background. Moreover, they commonly ignore the complementarity of different kinds of generative models. In this paper, we propose a context-aware approach to perform text-to-image generation, which separates background and foreground for generating high-quality images, as well as utilizes complementarity between Variational Autoencoder (VAE) and Generative Adversarial Network (GAN) for robust text-to-image generation. First, context-aware conditional VAE is proposed to capture images' basic layout and color based on text, which pays different attention on the background and foreground of images for effective text-image alignment. Then, conditional GAN is adopted for refining the generation of VAE, which recovers lost details and corrects the defects for realistic image generation. Attributed to such stacked VAE-GAN structure, two kinds of generative models can boost each other for more effective and stable text-to-image generation. Experimental results on 2 widely-used datasets empirically verify the effectiveness of our proposed approach.
What problem does this paper attempt to address?