Abstract:Text-to-image generation aims to generate images from text descriptions. Its main challenge lies in two aspects: (1) Semantic consistency, i.e., the generated images should be semantically consistent with the input text; (2) Visual reality, i.e., the generated images should look like real images. To ensure text-image consistency, existing works mainly learn to establish the cross-modal representations via a text encoder and image encoder. However, due to the limited representation capability of the fixed-length embeddings and the flexibility of the free-form text descriptions, the learned text-to-image model is incapable of maintaining the semantic consistency between image local regions and fine-grained descriptions. As a result, the generated images sometimes miss some fine-grained attributes of the generated object, such as the color or shape of a part of the object. To address this issue, this paper proposes a Local Feature Refinement Based Generative Adversarial Network (LFR-GAN) , which first divides the text into some independent fine-grained attributes and generates an initial image, then refines the image details based on these attributes. The main contributions are three-fold: (1) An attribute modeling approach is proposed to model the fine-grained text descriptions by mapping them into representations of independent attributes, which provides more fine-grained details for image generation. (2) A local feature refinement approach is proposed to enable the generated image to form a complete reflection of the fine-grained attributes contained in the text description. (3) A multi-stage generation approach is proposed to realize the fine-grained manipulation of complex images progressively, which aims to improve the performance of the refinement and generate photo-realistic images. Extensive experiments on the CUB and Oxford102 datasets show the effectiveness of our LFR-GAN approach in both text-to-image generation and text-guided image manipulation tasks. Our LFR-GAN approach shows superior performance to the state-of-the-art methods. The codes will be released at https://github.com/PKU-ICST-MIPL/LFR-GAN_TOMM2023.

Exploring Global and Local Linguistic Representations for Text-to-Image Synthesis

Specific Diverse Text-to-Image Synthesis Via Exemplar Guidance

Statistics Enhancement Generative Adversarial Networks for Diverse Conditional Image Synthesis

Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis.

LFR-GAN: Local Feature Refinement based Generative Adversarial Network for Text-to-Image Generation

MF-GAN: Multi-conditional Fusion Generative Adversarial Network for Text-to-Image Synthesis

CF-GAN: cross-domain feature fusion generative adversarial network for text-to-image synthesis

SAW-GAN: Multi-granularity Text Fusion Generative Adversarial Networks for text-to-image generation

DMF-GAN: Deep Multimodal Fusion Generative Adversarial Networks for Text-to-Image Synthesis

Cross-modal Feature Alignment Based Hybrid Attentional Generative Adversarial Networks for Text-to-image Synthesis

CT-GAN: A conditional Generative Adversarial Network of transformer architecture for text-to-image

Vision-Language Matching for Text-to-Image Synthesis via Generative Adversarial Networks

R-GAN: Exploring Human-like Way for Reasonable Text-to-Image Synthesis via Generative Adversarial Networks

CgT-GAN: CLIP-guided Text GAN for Image Captioning

CPGAN: Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis.

Multi-Sentence Auxiliary Adversarial Networks for Fine-Grained Text-to-Image Synthesis

MGF-GAN: Multi Granularity Text Feature Fusion for Text-guided-Image Synthesis

GACnet-Text-to-Image Synthesis With Generative Models Using Attention Mechanisms With Contrastive Learning

OptGAN: Optimizing and Interpreting the Latent Space of the Conditional Text-to-Image GANs

GL-GAN: Adaptive Global and Local Bilevel Optimization model of Image Generation

Word self-update contrastive adversarial networks for text-to-image synthesis