From Text to Pose to Image: Improving Diffusion Model Control and Quality

Clément Bonnett,Ariel N. Lee,Franck Wertel,Antoine Tamano,Tanguy Cizain,Pablo Ducru
2024-11-20
Abstract:In the last two years, text-to-image diffusion models have become extremely popular. As their quality and usage increase, a major concern has been the need for better output control. In addition to prompt engineering, one effective method to improve the controllability of diffusion models has been to condition them on additional modalities such as image style, depth map, or keypoints. This forms the basis of ControlNets or Adapters. When attempting to apply these methods to control human poses in outputs of text-to-image diffusion models, two main challenges have arisen. The first challenge is generating poses following a wide range of semantic text descriptions, for which previous methods involved searching for a pose within a dataset of (caption, pose) pairs. The second challenge is conditioning image generation on a specified pose while keeping both high aesthetic and high pose fidelity. In this article, we fix these two main issues by introducing a text-to-pose (T2P) generative model alongside a new sampling algorithm, and a new pose adapter that incorporates more pose keypoints for higher pose fidelity. Together, these two new state-of-the-art models enable, for the first time, a generative text-to-pose-to-image framework for higher pose control in diffusion models. We release all models and the code used for the experiments at <a class="link-external link-https" href="https://github.com/clement-bonnet/text-to-pose" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
This paper attempts to solve the problem of how to improve the precision and quality of human pose control in text - to - image generation. Specifically, the paper focuses on two main challenges: 1. **Generating human poses corresponding to a wide range of semantic descriptions**: Previous solutions usually need to search for suitable poses from datasets containing (title, pose) pairs. This limits the diversity and adaptability of poses. 2. **Generating high - quality images under specified pose conditions**: Previous methods perform poorly in generating images containing facial and hand details, resulting in low pose fidelity and the aesthetic quality of the images is also lower than that of the original model. To solve these problems, the paper proposes the following innovations: - **Text - to - Pose Generation Model (T2P)**: This is a new autoregressive model that can generate corresponding human poses according to text descriptions. This model is trained through a Contrastive Learning for Pose Prediction (CLaPP) framework to ensure that the generated poses highly match the text descriptions. - **New pose adapter**: This adapter not only includes body poses but also adds key points of the face and hands, thereby improving pose fidelity. In addition, by being trained on high - quality images, the images generated by this adapter are also more aesthetically similar to the original images. By combining these two models, the paper proposes a brand - new text - to - pose - to - image generation framework, which significantly improves the precision of pose control and the quality of images. This framework not only solves the above two challenges but also provides a new research direction for future image generation technologies.