Abstract:Although the text-to-image model aims to generate realistic images that correspond to the text description, generating high-quality, and accurate images remains a significant challenge. Most existing text-to-image methods are implemented through a two-stage stacking model, where the generation process is initiated by creating an initial image with a basic outline and subsequently refined to generate a high-resolution image. However, the quality of the initial image imposes limitations on this method as it directly impacts the final quality of the high-resolution output and may compromise the level of randomness in the high-resolution image, making it difficult for the model to generate a high-quality and realistic final image if the initial image is of low quality or lacks detail, causing the final image to lack diversity and to appear artificial if the initial image is too rigid or lacks randomness. Therefore, to overcome the limitation of the stacked structure, a new generative adversarial network method has been proposed, which generates high-resolution images directly from text descriptions, thus providing a more efficient and effective way to generate realistic images from text. Multi-head channel attention and masked cross-attention mechanisms are employed to emphasize the importance of relevance from various perspectives in order to enhance significant features associated with the text description and suppress non-essential features unrelated to the textual information. The integration of image and text information at a granular level is accomplished while employing a masked mechanism to minimize computational expenses and expedite the generation time of images. Furthermore, a discriminator-based semantic consistency loss function is devised to bolster the visual coherence between text and images, thereby directing the generator toward the production of more realistic images that align closely with text descriptions. The enhanced model improves the semantic consistency between text and images, leading to higher-quality generated images. Extensive experiments confirm the superiority of our proposed model to ControlGAN. On the CUB dataset, the model achieves an increased IS score from 4.58 to 4.96, while on the COCO dataset, the IS score improves from 24.06 to 33.56. Code is available at https://github.com/Leeziying0307/Github.git.

CT-GAN: A conditional Generative Adversarial Network of transformer architecture for text-to-image

Synthesizing Contrast-enhanced Computed Tomography Images with an Improved Conditional Generative Adversarial Network

TcGAN: Semantic-Aware and Structure-Preserved GANs with Individual Vision Transformer for Fast Arbitrary One-Shot Image Generation

Statistics Enhancement Generative Adversarial Networks for Diverse Conditional Image Synthesis

KT-GAN: Knowledge-Transfer Generative Adversarial Network for Text-to-Image Synthesis

CgT-GAN: CLIP-guided Text GAN for Image Captioning

GACnet-Text-to-Image Synthesis With Generative Models Using Attention Mechanisms With Contrastive Learning

DTGAN: Dual Attention Generative Adversarial Networks for Text-to-Image Generation

Tf-Gan: Text Feature Fusion Gan for Text-to-Image Generation

GANai: Standardizing CT Images using Generative Adversarial Network with Alternative Improvement

The Nuts and Bolts of Adopting Transformer in GANs

CSGAN: Cyclic-Synthesized Generative Adversarial Networks for image-to-image transformation

Masked cross-attention and multi-head channel attention guiding single-stage generative adversarial networks for text-to-image generation

TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up

LFR-GAN: Local Feature Refinement based Generative Adversarial Network for Text-to-Image Generation

CTGAN: Semantic-guided Conditional Texture Generator for 3D Shapes

OptGAN: Optimizing and Interpreting the Latent Space of the Conditional Text-to-Image GANs

RATLIP: Generative Adversarial CLIP Text-to-Image Synthesis Based on Recurrent Affine Transformations

TCGAN: a transformer-enhanced GAN for PET synthetic CT

CT synthesis from MR in the pelvic area using Residual Transformer Conditional GAN

Transformation GAN for Unsupervised Image Synthesis and Representation Learning