Abstract:The existing text-guided image synthesis methods can only produce limited quality results with at most \mbox{$\text{256}^2$} resolution and the textual instructions are constrained in a small Corpus. In this work, we propose a unified framework for both face image generation and manipulation that produces diverse and high-quality images with an unprecedented resolution at 1024 from multimodal inputs. More importantly, our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing. To be specific, we propose a brand new paradigm of text-guided image generation and manipulation based on the superior characteristics of a pretrained GAN model. Our proposed paradigm includes two novel strategies. The first strategy is to train a text encoder to obtain latent codes that align with the hierarchically semantic of the aforementioned pretrained GAN model. The second strategy is to directly optimize the latent codes in the latent space of the pretrained GAN model with guidance from a pretrained language model. The latent codes can be randomly sampled from a prior distribution or inverted from a given image, which provides inherent supports for both image generation and manipulation from multi-modal inputs, such as sketches or semantic labels, with textual guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. Code and data are available at <a class="link-external link-https" href="https://github.com/weihaox/TediGAN" rel="external noopener nofollow">this https URL</a>.

Generating Distinctive Facial Images from Natural Language Descriptions Via Spatial Map Fusion

Text-to-image Generation Based on Spatial-Channel Attention and Semantic Redescription

Realistic Face Reenactment Via Self-Supervised Disentangling of Identity and Pose

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

SIMGAN: Photo-Realistic Semantic Image Manipulation Using Generative Adversarial Networks.

Towards photorealistic face generation using text-guided Semantic-Spatial FaceGAN

Spatial Fusion GAN for Image Synthesis

An improved StyleGAN-based TextToFace model with Local-Global information Fusion

Recognizing Facial Sketches by Generating Photorealistic Faces Guided by Descriptive Attributes

Tf-Gan: Text Feature Fusion Gan for Text-to-Image Generation

Two Birds with One Stone: Iteratively Learn Facial Attributes with GANs.

CRFAST: Clip-Based Reference-Guided Facial Image Semantic Transfer

Two Birds with One Stone: Transforming and Generating Facial Images with Iterative GAN

Towards Open-World Text-Guided Face Image Generation and Manipulation

Semantic prior guided fine-grained facial expression manipulation

Interpreting the Latent Space of GANs for Semantic Face Editing

DMF-GAN: Deep Multimodal Fusion Generative Adversarial Networks for Text-to-Image Synthesis

Text2FaceGAN: Face Generation from Fine Grained Textual Descriptions

DualG-GAN, a Dual-channel Generator based Generative Adversarial Network for text-to-face synthesis

ISFB-GAN: Interpretable semantic face beautification with generative adversarial network

Semi-Latent GAN: Learning to Generate and Modify Facial Images from Attributes.