HumanRefiner: Benchmarking Abnormal Human Generation and Refining with Coarse-to-fine Pose-Reversible Guidance

Guian Fang,Wenbiao Yan,Yuanfan Guo,Jianhua Han,Zutao Jiang,Hang Xu,Shengcai Liao,Xiaodan Liang
2024-07-09
Abstract:Text-to-image diffusion models have significantly advanced in conditional image generation. However, these models usually struggle with accurately rendering images featuring humans, resulting in distorted limbs and other anomalies. This issue primarily stems from the insufficient recognition and evaluation of limb qualities in diffusion models. To address this issue, we introduce AbHuman, the first large-scale synthesized human benchmark focusing on anatomical anomalies. This benchmark consists of 56K synthesized human images, each annotated with detailed, bounding-box level labels identifying 147K human anomalies in 18 different categories. Based on this, the recognition of human anomalies can be established, which in turn enhances image generation through traditional techniques such as negative prompting and guidance. To further boost the improvement, we propose HumanRefiner, a novel plug-and-play approach for the coarse-to-fine refinement of human anomalies in text-to-image generation. Specifically, HumanRefiner utilizes a self-diagnostic procedure to detect and correct issues related to both coarse-grained abnormal human poses and fine-grained anomaly levels, facilitating pose-reversible diffusion generation. Experimental results on the AbHuman benchmark demonstrate that HumanRefiner significantly reduces generative discrepancies, achieving a 2.9x improvement in limb quality compared to the state-of-the-art open-source generator SDXL and a 1.4x improvement over DALL-E 3 in human evaluations. Our data and code are available at <a class="link-external link-https" href="https://github.com/Enderfga/HumanRefiner" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in text - to - image generation tasks, current diffusion models often have limb distortions or other abnormal phenomena when generating images containing humans. Specifically, these models face significant challenges when generating images with complex and flexible human body structures, such as generating multiple limbs or abnormal limb details (for example, irregular fingers). These problems mainly stem from the insufficient recognition and evaluation of limb quality by diffusion models. To address this problem, the authors introduced the AbHuman dataset, which is the first large - scale synthetic human benchmark dataset focusing on anatomical abnormalities. In addition, they also proposed the HumanRefiner method, a new plug - in method for refining human body abnormalities in text - to - image generation through coarse - to - fine pose - reversible guidance. ### Specific problems 1. **Limb abnormalities**: Existing models often have problems such as incorrect limb numbers, twisted hands, and incorrect limb positions when generating images containing humans. 2. **Insufficient control of detailed features**: Upstream training usually focuses on the alignment of overall features, such as whether humans have hands and legs, and pays less attention to detailed features (such as the number of fingers), resulting in limited control of the model when generating detailed body parts. 3. **Lack of abnormal knowledge**: Existing models lack the knowledge to distinguish between normal and abnormal human body features. ### Solutions 1. **AbHuman dataset**: A dataset containing 56,000 synthetic human images was constructed, and each image was annotated with detailed bounding - box - level labels, marking 147,000 human body abnormality instances. These abnormalities are subdivided into 18 different categories, such as "abnormal/normal head", "abnormal/normal hand", etc. 2. **HumanRefiner method**: A new plug - in method was proposed to detect and correct coarse - grained and fine - grained human body abnormalities through a coarse - to - fine self - diagnosis process. The specific steps include: - **Pose - guided generation**: Use a pose detector to generate an initially globally refined image. - **Anomaly detector - guided repair**: Use an anomaly detector to identify and repair local fine - grained abnormalities. - **Anomaly scorer**: Provide a quantitative index to evaluate the severity of limb abnormalities in the generated image. ### Experimental results The experimental results show that HumanRefiner significantly reduces the generation differences on the AbHuman benchmark dataset. Compared with the state - of - the - art open - source generator SDXL, the limb quality is improved by 2.9 times, and compared with DALL - E 3, the human evaluation results are improved by 1.4 times. ### Main contributions 1. **Introduction of the AbHuman dataset**: This is the first large - scale synthetic human benchmark dataset focusing on anatomical abnormalities. 2. **Proposal of the HumanRefiner method**: A new plug - in method for refining human body abnormalities through coarse - to - fine pose - reversible guidance. 3. **Experimental verification**: Comprehensive experimental results show that HumanRefiner is significantly superior to existing text - to - image models in generating high - quality human images.