Abstract:Image generation from natural language has become a very promising area of research on multimodal learning in recent years. In recent years, the performance of this theme has improved rapidly, and the release of powerful tools has caused a great response in various places. The Stacked Generative Adversarial Networks (StackGAN) model is a representative method to generate images from text descriptions. Although it can generate high-resolution images, it involves several limitations; some of the images generated are typically unintelligible, and mode collapse may occur. Therefore, in this study, we aim to solve these two problems to generate images that follow a given text description more closely. First, we incorporate a new consistency regularization technique for conditional generation tasks into StackGAN, called Improved Consistency Regularization or ICR. The ICR technique learns the meaning of data by matching the semantic information of input data before and after data augmentation, and can also stabilize learning in adversarial networks. In this research, this method mainly suppresses mode collapse by expanding the variation of generated images. However, this method may lead to excessive variations in the generated images, which may result in images that do not match the meaning of the input text or that are ambiguous. Therefore, we further propose a new regularization method called ICCR as a modification of ICR, which is designed to perform conditional generation tasks and eliminate the negative impacts of the generator. This method realized the generation of various images along the input text. The proposed StackGAN with ICCR performed 16% better than StackGAN and 4% better than StackGAN with ICR and AttnGAN on the Inception Score using the CUB dataset. AttnGAN, similar to StackGAN, is a GAN-based text-to-image model that incorporates the attention mechanism, which has achieved great results in recent years. It is very important that our proposed model, which incorporates ICCR into a simple model, obtained better results than AttnGAN. In addition, StackGAN with ICCR was effective in eliminating mode collapse. The probability of mode collapse in the original StackGAN was 20%, while in StackGAN with ICCR the probability was 0%. In the questionnaire survey, our proposed method was rated 18% higher than StackGAN with ICR. This indicates that ICCR is more effective for conditional tasks than ICR.

Stacking VAE and GAN for Context-aware Text-to-Image Generation

Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge.

Text to Image Synthesis using Stacked Conditional Variational Autoencoders and Conditional Generative Adversarial Networks

Diversified text-to-image generation via deep mutual information estimation

StackGAN: Text to Photo-Realistic Image Synthesis with Stacked Generative Adversarial Networks

Specific Diverse Text-to-Image Synthesis Via Exemplar Guidance

StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks

Text to Image Synthesis using Generative Adversarial Networks

Image Generation from Text Using StackGAN with Improved Conditional Consistency Regularization

Variational Hetero-Encoder Randomized GANs for Joint Image-Text Modeling

Masked cross-attention and multi-head channel attention guiding single-stage generative adversarial networks for text-to-image generation

Densely Stacked Generative Adversarial Networks

Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis.

R-GAN: Exploring Human-like Way for Reasonable Text-to-Image Synthesis via Generative Adversarial Networks

A Comparative Study of Generative Adversarial Networks for Text-to-Image Synthesis

DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis

Stack-VS: Stacked Visual-Semantic Attention for Image Caption Generation

OptGAN: Optimizing and Interpreting the Latent Space of the Conditional Text-to-Image GANs

Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks