Abstract:Although the text-to-image model aims to generate realistic images that correspond to the text description, generating high-quality, and accurate images remains a significant challenge. Most existing text-to-image methods are implemented through a two-stage stacking model, where the generation process is initiated by creating an initial image with a basic outline and subsequently refined to generate a high-resolution image. However, the quality of the initial image imposes limitations on this method as it directly impacts the final quality of the high-resolution output and may compromise the level of randomness in the high-resolution image, making it difficult for the model to generate a high-quality and realistic final image if the initial image is of low quality or lacks detail, causing the final image to lack diversity and to appear artificial if the initial image is too rigid or lacks randomness. Therefore, to overcome the limitation of the stacked structure, a new generative adversarial network method has been proposed, which generates high-resolution images directly from text descriptions, thus providing a more efficient and effective way to generate realistic images from text. Multi-head channel attention and masked cross-attention mechanisms are employed to emphasize the importance of relevance from various perspectives in order to enhance significant features associated with the text description and suppress non-essential features unrelated to the textual information. The integration of image and text information at a granular level is accomplished while employing a masked mechanism to minimize computational expenses and expedite the generation time of images. Furthermore, a discriminator-based semantic consistency loss function is devised to bolster the visual coherence between text and images, thereby directing the generator toward the production of more realistic images that align closely with text descriptions. The enhanced model improves the semantic consistency between text and images, leading to higher-quality generated images. Extensive experiments confirm the superiority of our proposed model to ControlGAN. On the CUB dataset, the model achieves an increased IS score from 4.58 to 4.96, while on the COCO dataset, the IS score improves from 24.06 to 33.56. Code is available at https://github.com/Leeziying0307/Github.git.

Adaptive Forgetting, Drafting and Comprehensive Guiding: Text-to-Image Synthesis with Hierarchical Generative Adversarial Networks

Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge.

Specific Diverse Text-to-Image Synthesis Via Exemplar Guidance

DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis

DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis.

KT-GAN: Knowledge-Transfer Generative Adversarial Network for Text-to-Image Synthesis

Object-driven Text-to-Image Synthesis via Adversarial Training

Perceptual Pyramid Adversarial Networks for Text-to-Image Synthesis.

SAM-GAN: Self-Attention supporting Multi-stage Generative Adversarial Networks for text-to-image synthesis

Customizable GAN: Customizable Image Synthesis Based on Adversarial Learning.

DGattGAN: Cooperative Up-Sampling Based Dual Generator Attentional GAN on Text-to-Image Synthesis

Masked cross-attention and multi-head channel attention guiding single-stage generative adversarial networks for text-to-image generation

Hybrid Attention Driven Text-To-Image Synthesis Via Generative Adversarial Networks

Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis.

DMF-GAN: Deep Multimodal Fusion Generative Adversarial Networks for Text-to-Image Synthesis

SAW-GAN: Multi-granularity Text Fusion Generative Adversarial Networks for text-to-image generation

DTGAN: Dual Attention Generative Adversarial Networks for Text-to-Image Generation

Investigation related to application of Generative Adversarial Networks in text-to-image synthesis

DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis

Cross-modal Feature Alignment Based Hybrid Attentional Generative Adversarial Networks for Text-to-image Synthesis

R-GAN: Exploring Human-like Way for Reasonable Text-to-Image Synthesis via Generative Adversarial Networks