Abstract:Controllable text-to-image (T2I) diffusion models have shown impressive performance in generating high-quality visual content through the incorporation of various conditions. Current methods, however, exhibit limited performance when guided by skeleton human poses, especially in complex pose conditions such as side or rear perspectives of human figures. To address this issue, we present Stable-Pose, a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models. Stable-Pose is designed to adeptly handle pose conditions within pre-trained Stable Diffusion, providing a refined and efficient way of aligning pose representation during image synthesis. We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons. Masked pose images are used to smoothly refine the attention maps based on target pose-related features in a hierarchical manner, transitioning from coarse to fine levels. Additionally, our loss function is formulated to allocate increased emphasis to the pose region, thereby augmenting the model's precision in capturing intricate pose details. We assessed the performance of Stable-Pose across five public datasets under a wide range of indoor and outdoor human pose scenarios. Stable-Pose achieved an AP score of 57.1 in the LAION-Human dataset, marking around 13% improvement over the established technique ControlNet. The project link and code is available at <a class="link-external link-https" href="https://github.com/ai-med/StablePose" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

This paper attempts to solve the problem of insufficient accuracy in generating images when using skeletal poses as guiding conditions in text - to - image generation. Specifically, current methods perform poorly when dealing with complex pose conditions (such as side - view or rear - view human poses), especially having difficulties in maintaining human body proportions and pose details. To this end, the paper proposes a new model named Stable - Pose. By introducing the Vision Transformer (ViT) and a coarse - to - fine attention mask strategy, it improves the performance of pose - guided text - to - image generation models. ### Main problems 1. **Generation quality under complex pose conditions**: Existing methods generate images of low quality when dealing with complex poses (such as side - view or rear - view human poses), especially having difficulty in accurately capturing pose details. 2. **Maintenance of human body proportions**: Existing methods may not be able to maintain accurate human body proportions when generating images, resulting in the generated images looking unnatural. 3. **Precision of pose guidance**: Existing methods are insufficient in the precision of pose guidance, especially when dealing with sparse pose data. ### Solutions To address the above problems, Stable - Pose proposes the following solutions: - **Coarse - to - fine attention mask strategy**: By introducing a coarse - to - fine attention mask strategy in ViT, it gradually refines the attention to pose areas, thereby improving the precision of pose guidance. - **Pose - mask - guided loss function**: A new loss function is designed, which increases the emphasis on pose areas and further improves the model's performance in capturing pose details. - **Efficient integration into pre - trained models**: Stable - Pose can be efficiently integrated into the pre - trained Stable Diffusion model, providing a lightweight method for improving pose control. ### Experimental results The paper conducted experiments on multiple public datasets, including Human - Art, LAION - Human, UBC Fashion, Dance Track, and DAVIS. The experimental results show that Stable - Pose has achieved significant improvements in pose accuracy, image quality, and text - image alignment, especially when dealing with complex poses and poses from different perspectives. ### Contributions - **High - precision pose guidance**: Stable - Pose can accurately capture complex pose details while generating high - quality images, and can maintain high precision even under challenging conditions. - **Innovative attention mechanism**: By introducing the coarse - to - fine attention mask strategy, Stable - Pose can more effectively handle the relationships between various parts of the human body pose. - **Wide applicability**: Stable - Pose performs well in multiple datasets and application scenarios, demonstrating its wide applicability and robustness in pose - guided text - to - image generation tasks. In conclusion, Stable - Pose significantly improves the performance of pose - guided text - to - image generation models by introducing innovative attention mechanisms and loss functions, and solves the shortcomings of existing methods in dealing with complex poses.

Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

From External to Internal: Structuring Image for Text-to-Image Attributes Manipulation

STN-enhanced Message Passing Guided by Adversarial Learning for Human Pose Estimation

Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos

From Text to Pose to Image: Improving Diffusion Model Control and Quality

HEViTPose: High-Efficiency Vision Transformer for Human Pose Estimation

Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Spatiotemporal Learning Transformer for Video-Based Human Pose Estimation

Poseur: Direct Human Pose Regression with Transformers.

Shift Pose: A Lightweight Transformer-like Neural Network for Human Pose Estimation

GITPose: going shallow and deeper using vision transformers for human pose estimation

ViTPose++: Vision Transformer for Generic Body Pose Estimation

Bilateral Pose Transformer for Human Pose Estimation.

TCAN: Animating Human Images with Temporally Consistent Pose Guidance using Diffusion Models

PoseEmbroider: Towards a 3D, Visual, Semantic-aware Human Pose Representation

Advancing Human Pose Estimation with Transformer Models: an Experimental Approach

Exploiting Temporal Contexts with Strided Transformer for 3D Human Pose Estimation

3D Human Pose Estimation with Spatial and Temporal Transformers

PoseCrafter: One-Shot Personalized Video Synthesis Following Flexible Pose Control

KPE: Keypoint Pose Encoding for Transformer-based Image Generation