Abstract:Synthesizing vivid human portraits is a research hot spot in image generation with a wide scope of applications. In addition to fidelity, generation controllability is another key factor that has long plagued its development. To address this issue, existing solutions usually adopt either textual or visual conditions for the target face synthesis, e.g., descriptions or segmentation masks, which still cannot fully control the generation due to the intrinsic shortages of each condition. In this paper, we propose to make use of both types of prior information to facilitate controllable face generation. In particular, we hope to produce coarse-grained information about faces based on the segmentation masks, such as face shapes and poses, and the text description is used to render detailed face attributes, e.g., face color, makeup and gender. More importantly, we hope that the generation can be easily controlled via interactively editing both types of information, making face generation more applicable to real-world applications. To accomplish this target, we propose a novel face generation model termed PixelFace+. In PixelFace+, both the text and mask are encoded as pixel-wise priors, based on which the pixel synthesis process is conducted to produce the expected portraits. Meanwhile, the loss objectives are also carefully designed to make sure that the generated faces are semantically aligned with both text and mask inputs. To validate the proposed PixelFace+, we conducted a comprehensive set of experiments on the widely recognized benchmark called MMCelebA. We not only quantitatively compare PixelFace+ with a bunch of newly proposed Text-to-Face(T2F) generation methods, but also give plenty of qualitative analyses. The experimental results demonstrate that PixelFace+ not only outperforms existing generation methods in both image quality and conditional matching but also shows a much superior controllability of face generation. More importantly, PixelFace+ presents a convenient and interactive way of face generation and manipulation via editing the text and mask inputs. Our SOURCE CODE and DEMO are given in our supplementary materials.

SpaText: Spatio-Textual Representation for Controllable Image Generation—Supplementary Material

SpaText: Spatio-Textual Representation for Controllable Image Generation

Chasing Consistency in Text-to-3D Generation from a Single Image.

From External to Internal: Structuring Image for Text-to-Image Attributes Manipulation

Text-Guided Human Image Manipulation Via Image-Text Shared Space

Masked-Attention Diffusion Guidance for Spatially Controlling Text-to-Image Generation

Getting it Right: Improving Spatial Consistency in Text-to-Image Models

Multi-Tailed, Multi-Headed, Spatial Dynamic Memory refined Text-to-Image Synthesis

PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control

Learning Continuous 3D Words for Text-to-Image Generation

FineControlNet: Fine-level Text Control for Image Generation with Spatially Aligned Text Control Injection

GenesisTex2: Stable, Consistent and High-Quality Text-to-Texture Generation

Imagen 3

PixelFace+: Towards Controllable Face Generation and Manipulation with Text Descriptions and Segmentation Masks

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Break-A-Scene: Extracting Multiple Concepts from a Single Image.

End-to-End Text-to-Image Synthesis with Spatial Constrains

Saliency Detection of Textured 3D Models Based on Multi-View Information and Texel Descriptor

Supplementary Materials for AE TextSpotter

Text2Human: Text-Driven Controllable Human Image Generation.