Abstract:Text-to-image generation intends to automatically produce a photo-realistic image, conditioned on a textual description. It can be potentially employed in the field of art creation, data augmentation, photo-editing, etc. Although many efforts have been dedicated to this task, it remains particularly challenging to generate believable, natural scenes. To facilitate the real-world applications of text-to-image synthesis, we focus on studying the following three issues: 1) How to ensure that generated samples are believable, realistic or natural? 2) How to exploit the latent space of the generator to edit a synthesized image? 3) How to improve the explainability of a text-to-image generation framework? In this work, we constructed two novel data sets (i.e., the Good & Bad bird and face data sets) consisting of successful as well as unsuccessful generated samples, according to strict criteria. To effectively and efficiently acquire high-quality images by increasing the probability of generating Good latent codes, we use a dedicated Good/Bad classifier for generated images. It is based on a pre-trained front end and fine-tuned on the basis of the proposed Good & Bad data set. After that, we present a novel algorithm which identifies semantically-understandable directions in the latent space of a conditional text-to-image GAN architecture by performing independent component analysis on the pre-trained weight values of the generator. Furthermore, we develop a background-flattening loss (BFL), to improve the background appearance in the edited image. Subsequently, we introduce linear interpolation analysis between pairs of keywords. This is extended into a similar triangular `linguistic' interpolation in order to take a deep look into what a text-to-image synthesis model has learned within the linguistic embeddings. Our data set is available at <a class="link-external link-https" href="https://zenodo.org/record/6283798#.YhkN_ujMI2w" rel="external noopener nofollow">this https URL</a>.

Exploring Sparse MoE in GANs for Text-conditioned Image Synthesis.

Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge.

Specific Diverse Text-to-Image Synthesis Via Exemplar Guidance

Statistics Enhancement Generative Adversarial Networks for Diverse Conditional Image Synthesis

Multimodal Conditional Image Synthesis with Product-of-Experts GANs

DMF-GAN: Deep Multimodal Fusion Generative Adversarial Networks for Text-to-Image Synthesis

R-GAN: Exploring Human-like Way for Reasonable Text-to-Image Synthesis via Generative Adversarial Networks

Object-driven Text-to-Image Synthesis via Adversarial Training

You Only Sample Once: Taming One-Step Text-to-Image Synthesis by Self-Cooperative Diffusion GANs

SparseGAN: Sparse Generative Adversarial Network for Text Generation

DF-GAN: Deep Fusion Generative Adversarial Networks for Text-to-Image Synthesis.

OptGAN: Optimizing and Interpreting the Latent Space of the Conditional Text-to-Image GANs

Multi-Sentence Auxiliary Adversarial Networks for Fine-Grained Text-to-Image Synthesis

DM-GAN: Dynamic Memory Generative Adversarial Networks for Text-to-Image Synthesis

SAM-GAN: Self-Attention supporting Multi-stage Generative Adversarial Networks for text-to-image synthesis

Learning to Draw Text in Natural Images with Conditional Adversarial Networks

SAW-GAN: Multi-granularity Text Fusion Generative Adversarial Networks for text-to-image generation

CF-GAN: cross-domain feature fusion generative adversarial network for text-to-image synthesis

Text-to-Image Synthesis via Visual-Memory Creative Adversarial Network.

Adaptive Forgetting, Drafting and Comprehensive Guiding: Text-to-Image Synthesis with Hierarchical Generative Adversarial Networks

DF-GAN: A Simple and Effective Baseline for Text-to-Image Synthesis