Abstract:The existing text-guided image synthesis methods can only produce limited quality results with at most \mbox{$\text{256}^2$} resolution and the textual instructions are constrained in a small Corpus. In this work, we propose a unified framework for both face image generation and manipulation that produces diverse and high-quality images with an unprecedented resolution at 1024 from multimodal inputs. More importantly, our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing. To be specific, we propose a brand new paradigm of text-guided image generation and manipulation based on the superior characteristics of a pretrained GAN model. Our proposed paradigm includes two novel strategies. The first strategy is to train a text encoder to obtain latent codes that align with the hierarchically semantic of the aforementioned pretrained GAN model. The second strategy is to directly optimize the latent codes in the latent space of the pretrained GAN model with guidance from a pretrained language model. The latent codes can be randomly sampled from a prior distribution or inverted from a given image, which provides inherent supports for both image generation and manipulation from multi-modal inputs, such as sketches or semantic labels, with textual guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. Code and data are available at <a class="link-external link-https" href="https://github.com/weihaox/TediGAN" rel="external noopener nofollow">this https URL</a>.

End-to-End Text-to-Image Synthesis with Spatial Constrains

Fine-grained Semantic Constraint in Image Synthesis

Specific Diverse Text-to-Image Synthesis Via Exemplar Guidance

Object-driven Text-to-Image Synthesis via Adversarial Training

Layout-Bridging Text-to-Image Synthesis

Multi-Tailed, Multi-Headed, Spatial Dynamic Memory refined Text-to-Image Synthesis

Text-to-image synthesis: Starting composite from the foreground content

R-GAN: Exploring Human-like Way for Reasonable Text-to-Image Synthesis via Generative Adversarial Networks

Training-free Composite Scene Generation for Layout-to-Image Synthesis

Perceptual Pyramid Adversarial Networks for Text-to-Image Synthesis.

Scene Text Synthesis for Efficient and Effective Deep Network Training

Verisimilar Image Synthesis for Accurate Detection and Recognition of Texts in Scenes

Text-to-Image Synthesis via Visual-Memory Creative Adversarial Network.

LoCo: Locally Constrained Training-Free Layout-to-Image Synthesis

Text Pared into Scene Graph for Diverse Image Generation.

TCGIS: Text and Contour Guided Controllable Image Synthesis

Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis

Label-free Neural Semantic Image Synthesis

Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

Towards Open-World Text-Guided Face Image Generation and Manipulation

Improving Text Generation on Images with Synthetic Captions