From Text to Pose to Image: Improving Diffusion Model Control and Quality

Clément Bonnett,Ariel N. Lee,Franck Wertel,Antoine Tamano,Tanguy Cizain,Pablo Ducru

2024-11-20

Abstract:In the last two years, text-to-image diffusion models have become extremely popular. As their quality and usage increase, a major concern has been the need for better output control. In addition to prompt engineering, one effective method to improve the controllability of diffusion models has been to condition them on additional modalities such as image style, depth map, or keypoints. This forms the basis of ControlNets or Adapters. When attempting to apply these methods to control human poses in outputs of text-to-image diffusion models, two main challenges have arisen. The first challenge is generating poses following a wide range of semantic text descriptions, for which previous methods involved searching for a pose within a dataset of (caption, pose) pairs. The second challenge is conditioning image generation on a specified pose while keeping both high aesthetic and high pose fidelity. In this article, we fix these two main issues by introducing a text-to-pose (T2P) generative model alongside a new sampling algorithm, and a new pose adapter that incorporates more pose keypoints for higher pose fidelity. Together, these two new state-of-the-art models enable, for the first time, a generative text-to-pose-to-image framework for higher pose control in diffusion models. We release all models and the code used for the experiments at <a class="link-external link-https" href="https://github.com/clement-bonnet/text-to-pose" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

This paper attempts to solve the problem of how to improve the precision and quality of human pose control in text - to - image generation. Specifically, the paper focuses on two main challenges: 1. **Generating human poses corresponding to a wide range of semantic descriptions**: Previous solutions usually need to search for suitable poses from datasets containing (title, pose) pairs. This limits the diversity and adaptability of poses. 2. **Generating high - quality images under specified pose conditions**: Previous methods perform poorly in generating images containing facial and hand details, resulting in low pose fidelity and the aesthetic quality of the images is also lower than that of the original model. To solve these problems, the paper proposes the following innovations: - **Text - to - Pose Generation Model (T2P)**: This is a new autoregressive model that can generate corresponding human poses according to text descriptions. This model is trained through a Contrastive Learning for Pose Prediction (CLaPP) framework to ensure that the generated poses highly match the text descriptions. - **New pose adapter**: This adapter not only includes body poses but also adds key points of the face and hands, thereby improving pose fidelity. In addition, by being trained on high - quality images, the images generated by this adapter are also more aesthetically similar to the original images. By combining these two models, the paper proposes a brand - new text - to - pose - to - image generation framework, which significantly improves the precision of pose control and the quality of images. This framework not only solves the above two challenges but also provides a new research direction for future image generation technologies.

From Text to Pose to Image: Improving Diffusion Model Control and Quality

Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory

Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

Controlling Human Shape and Pose in Text-to-Image Diffusion Models via Domain Adaptation

ECNet: Effective Controllable Text-to-Image Diffusion Models

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

Text-image Alignment for Diffusion-based Perception

DiffBlender: Scalable and Composable Multimodal Text-to-Image Diffusion Models

TextCraftor: Your Text Encoder Can be Image Quality Controller

If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

Control-A-Video: Controllable Text-to-Video Generation with Diffusion Models

Sketch-Guided Text-to-Image Diffusion Models

Implementing and Experimenting with Diffusion Models for Text-to-Image Generation

Uni-ControlNet: All-in-One Control to Text-to-Image Diffusion Models

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

Test-time Conditional Text-to-Image Synthesis Using Diffusion Models

ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models

SpaText: Spatio-Textual Representation for Controllable Image Generation

Controlled and Conditional Text to Image Generation with Diffusion Prior

Towards Real-time Text-driven Image Manipulation with Unconditional Diffusion Models