PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control

Rishubh Parihar,Sachidanand VS,Sabariswaran Mani,Tejan Karmali,R. Venkatesh Babu

2024-07-24

Abstract:Recently, we have seen a surge of personalization methods for text-to-image (T2I) diffusion models to learn a concept using a few images. Existing approaches, when used for face personalization, suffer to achieve convincing inversion with identity preservation and rely on semantic text-based editing of the generated face. However, a more fine-grained control is desired for facial attribute editing, which is challenging to achieve solely with text prompts. In contrast, StyleGAN models learn a rich face prior and enable smooth control towards fine-grained attribute editing by latent manipulation. This work uses the disentangled $\mathcal{W+}$ space of StyleGANs to condition the T2I model. This approach allows us to precisely manipulate facial attributes, such as smoothly introducing a smile, while preserving the existing coarse text-based control inherent in T2I models. To enable conditioning of the T2I model on the $\mathcal{W+}$ space, we train a latent mapper to translate latent codes from $\mathcal{W+}$ to the token embedding space of the T2I model. The proposed approach excels in the precise inversion of face images with attribute preservation and facilitates continuous control for fine-grained attribute editing. Furthermore, our approach can be readily extended to generate compositions involving multiple individuals. We perform extensive experiments to validate our method for face personalization and fine-grained attribute editing.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The main goal of this paper is to address the issue of precise control over facial attributes in Text-to-Image (T2I) generation models, particularly the challenges faced in personalized human portrait generation. Specifically, the paper attempts to solve the following key problems: 1. **Identity Preservation and Fine Attribute Editing**: Existing T2I diffusion models struggle to achieve convincing identity preservation during facial personalization and rely on text-based semantic editing to modify generated facial images. However, it is difficult to achieve fine control over facial attributes (such as smiles, beards, etc.) through text prompts alone. 2. **Combining the Advantages of Two Models**: The paper proposes a method that combines the advantages of T2I diffusion models and StyleGAN models, using the T2I model for coarse-grained text control while utilizing the W+ space of the StyleGAN model for fine-grained attribute control. 3. **Multiple Subject Generation**: The paper also explores how to combine multiple personalized subjects in a single scene to achieve high-fidelity identity preservation and avoid attribute mixing between different faces. To address the above issues, the paper proposes a framework named "PreciseControl," which achieves fine control over generated facial images by conditioning the W+ space of StyleGANs on the T2I model. Additionally, the paper introduces a novel method to fuse multiple personalized models to handle multi-subject generation scenarios. In summary, this research aims to achieve more precise attribute control in personalized facial image generation by combining the strengths of T2I diffusion models and StyleGAN models, while maintaining good identity consistency.

PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control

Attribute-specific Control Units in StyleGAN for Fine-grained Image Manipulation

DisControlFace: Disentangled Control for Personalized Facial Image Editing

Continuous, Subject-Specific Attribute Control in T2I Models by Identifying Semantic Directions

Towards Spatially Disentangled Manipulation of Face Images With Pre-Trained StyleGANs

Exploring Attribute Variations in Style-based GANs using Diffusion Models

FaceController: Controllable Attribute Editing for Face in the Wild

FacialGAN: Style Transfer and Attribute Manipulation on Synthetic Faces

A Latent Transformer for Disentangled Face Editing in Images and Videos

MagicTailor: Component-Controllable Personalization in Text-to-Image Diffusion Models

Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation

TransEditor: Transformer-Based Dual-Space GAN for Highly Controllable Facial Editing

Personalized Face Inpainting with Diffusion Models by Parallel Visual Attention

ID-Aligner: Enhancing Identity-Preserving Text-to-Image Generation with Reward Feedback Learning

Controllable 3D Face Generation with Conditional Style Code Diffusion

PIE: Portrait Image Embedding for Semantic Control

Video2StyleGAN: Disentangling Local and Global Variations in a Video

Which Style Makes Me Attractive? Interpretable Control Discovery and Counterfactual Explanation on StyleGAN

Direct Consistency Optimization for Robust Customization of Text-to-Image Diffusion Models

Controllable and Identity-Aware Facial Attribute Transformation

Towards Arbitrary Text-driven Image Manipulation Via Space Alignment