Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

Jiajun Wang,Morteza Ghahremani,Yitong Li,Björn Ommer,Christian Wachinger
2024-06-05
Abstract:Controllable text-to-image (T2I) diffusion models have shown impressive performance in generating high-quality visual content through the incorporation of various conditions. Current methods, however, exhibit limited performance when guided by skeleton human poses, especially in complex pose conditions such as side or rear perspectives of human figures. To address this issue, we present Stable-Pose, a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models. Stable-Pose is designed to adeptly handle pose conditions within pre-trained Stable Diffusion, providing a refined and efficient way of aligning pose representation during image synthesis. We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons. Masked pose images are used to smoothly refine the attention maps based on target pose-related features in a hierarchical manner, transitioning from coarse to fine levels. Additionally, our loss function is formulated to allocate increased emphasis to the pose region, thereby augmenting the model's precision in capturing intricate pose details. We assessed the performance of Stable-Pose across five public datasets under a wide range of indoor and outdoor human pose scenarios. Stable-Pose achieved an AP score of 57.1 in the LAION-Human dataset, marking around 13% improvement over the established technique ControlNet. The project link and code is available at <a class="link-external link-https" href="https://github.com/ai-med/StablePose" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
This paper attempts to solve the problem of insufficient accuracy in generating images when using skeletal poses as guiding conditions in text - to - image generation. Specifically, current methods perform poorly when dealing with complex pose conditions (such as side - view or rear - view human poses), especially having difficulties in maintaining human body proportions and pose details. To this end, the paper proposes a new model named Stable - Pose. By introducing the Vision Transformer (ViT) and a coarse - to - fine attention mask strategy, it improves the performance of pose - guided text - to - image generation models. ### Main problems 1. **Generation quality under complex pose conditions**: Existing methods generate images of low quality when dealing with complex poses (such as side - view or rear - view human poses), especially having difficulty in accurately capturing pose details. 2. **Maintenance of human body proportions**: Existing methods may not be able to maintain accurate human body proportions when generating images, resulting in the generated images looking unnatural. 3. **Precision of pose guidance**: Existing methods are insufficient in the precision of pose guidance, especially when dealing with sparse pose data. ### Solutions To address the above problems, Stable - Pose proposes the following solutions: - **Coarse - to - fine attention mask strategy**: By introducing a coarse - to - fine attention mask strategy in ViT, it gradually refines the attention to pose areas, thereby improving the precision of pose guidance. - **Pose - mask - guided loss function**: A new loss function is designed, which increases the emphasis on pose areas and further improves the model's performance in capturing pose details. - **Efficient integration into pre - trained models**: Stable - Pose can be efficiently integrated into the pre - trained Stable Diffusion model, providing a lightweight method for improving pose control. ### Experimental results The paper conducted experiments on multiple public datasets, including Human - Art, LAION - Human, UBC Fashion, Dance Track, and DAVIS. The experimental results show that Stable - Pose has achieved significant improvements in pose accuracy, image quality, and text - image alignment, especially when dealing with complex poses and poses from different perspectives. ### Contributions - **High - precision pose guidance**: Stable - Pose can accurately capture complex pose details while generating high - quality images, and can maintain high precision even under challenging conditions. - **Innovative attention mechanism**: By introducing the coarse - to - fine attention mask strategy, Stable - Pose can more effectively handle the relationships between various parts of the human body pose. - **Wide applicability**: Stable - Pose performs well in multiple datasets and application scenarios, demonstrating its wide applicability and robustness in pose - guided text - to - image generation tasks. In conclusion, Stable - Pose significantly improves the performance of pose - guided text - to - image generation models by introducing innovative attention mechanisms and loss functions, and solves the shortcomings of existing methods in dealing with complex poses.