PreciseControl: Enhancing Text-To-Image Diffusion Models with Fine-Grained Attribute Control

Rishubh Parihar,Sachidanand VS,Sabariswaran Mani,Tejan Karmali,R. Venkatesh Babu
2024-07-24
Abstract:Recently, we have seen a surge of personalization methods for text-to-image (T2I) diffusion models to learn a concept using a few images. Existing approaches, when used for face personalization, suffer to achieve convincing inversion with identity preservation and rely on semantic text-based editing of the generated face. However, a more fine-grained control is desired for facial attribute editing, which is challenging to achieve solely with text prompts. In contrast, StyleGAN models learn a rich face prior and enable smooth control towards fine-grained attribute editing by latent manipulation. This work uses the disentangled $\mathcal{W+}$ space of StyleGANs to condition the T2I model. This approach allows us to precisely manipulate facial attributes, such as smoothly introducing a smile, while preserving the existing coarse text-based control inherent in T2I models. To enable conditioning of the T2I model on the $\mathcal{W+}$ space, we train a latent mapper to translate latent codes from $\mathcal{W+}$ to the token embedding space of the T2I model. The proposed approach excels in the precise inversion of face images with attribute preservation and facilitates continuous control for fine-grained attribute editing. Furthermore, our approach can be readily extended to generate compositions involving multiple individuals. We perform extensive experiments to validate our method for face personalization and fine-grained attribute editing.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main goal of this paper is to address the issue of precise control over facial attributes in Text-to-Image (T2I) generation models, particularly the challenges faced in personalized human portrait generation. Specifically, the paper attempts to solve the following key problems: 1. **Identity Preservation and Fine Attribute Editing**: Existing T2I diffusion models struggle to achieve convincing identity preservation during facial personalization and rely on text-based semantic editing to modify generated facial images. However, it is difficult to achieve fine control over facial attributes (such as smiles, beards, etc.) through text prompts alone. 2. **Combining the Advantages of Two Models**: The paper proposes a method that combines the advantages of T2I diffusion models and StyleGAN models, using the T2I model for coarse-grained text control while utilizing the W+ space of the StyleGAN model for fine-grained attribute control. 3. **Multiple Subject Generation**: The paper also explores how to combine multiple personalized subjects in a single scene to achieve high-fidelity identity preservation and avoid attribute mixing between different faces. To address the above issues, the paper proposes a framework named "PreciseControl," which achieves fine control over generated facial images by conditioning the W+ space of StyleGANs on the T2I model. Additionally, the paper introduces a novel method to fuse multiple personalized models to handle multi-subject generation scenarios. In summary, this research aims to achieve more precise attribute control in personalized facial image generation by combining the strengths of T2I diffusion models and StyleGAN models, while maintaining good identity consistency.