Abstract:The existing text-guided image synthesis methods can only produce limited quality results with at most \mbox{$\text{256}^2$} resolution and the textual instructions are constrained in a small Corpus. In this work, we propose a unified framework for both face image generation and manipulation that produces diverse and high-quality images with an unprecedented resolution at 1024 from multimodal inputs. More importantly, our method supports open-world scenarios, including both image and text, without any re-training, fine-tuning, or post-processing. To be specific, we propose a brand new paradigm of text-guided image generation and manipulation based on the superior characteristics of a pretrained GAN model. Our proposed paradigm includes two novel strategies. The first strategy is to train a text encoder to obtain latent codes that align with the hierarchically semantic of the aforementioned pretrained GAN model. The second strategy is to directly optimize the latent codes in the latent space of the pretrained GAN model with guidance from a pretrained language model. The latent codes can be randomly sampled from a prior distribution or inverted from a given image, which provides inherent supports for both image generation and manipulation from multi-modal inputs, such as sketches or semantic labels, with textual guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method. Code and data are available at <a class="link-external link-https" href="https://github.com/weihaox/TediGAN" rel="external noopener nofollow">this https URL</a>.

Prompt-Based Modality Bridging for Unified Text-to-Face Generation and Manipulation

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

User-Friendly Customized Generation with Multi-Modal Prompts

Towards Open-World Text-Guided Face Image Generation and Manipulation

PromptMagician: Interactive Prompt Engineering for Text-to-Image Creation

AI Illustrator: Translating Raw Descriptions into Images by Prompt-based Cross-Modal Generation

One Model to Edit Them All: Free-Form Text-Driven Image Manipulation with Semantic Modulations

Multimodal-driven Talking Face Generation via a Unified Diffusion-based Generator

CLIP-PAE: Projection-Augmentation Embedding to Extract Relevant Features for a Disentangled, Interpretable, and Controllable Text-Guided Face Manipulation

TextCLIP: Text-Guided Face Image Generation And Manipulation Without Adversarial Training

PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement

Prompt-Softbox-Prompt: A free-text Embedding Control for Image Editing

EMMA: Your Text-to-Image Diffusion Model Can Secretly Accept Multi-Modal Prompts

MM2Latent: Text-to-facial image generation and editing in GANs with multimodal assistance

Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models

Multi-modal Generation via Cross-Modal In-Context Learning

A User-Friendly Framework for Generating Model-Preferred Prompts in Text-to-Image Synthesis

Capability-aware Prompt Reformulation Learning for Text-to-Image Generation

Promoting Unified Generative Framework with Descriptive Prompts for Joint Multi-Intent Detection and Slot Filling

Deeply Coupled Cross-Modal Prompt Learning

Multi-Prompt with Depth Partitioned Cross-Modal Learning