Performance Comparison of ControlNet Models Based on PONY in Complex Human Pose Image Generation

Qinyu Zeng
DOI: https://doi.org/10.54254/2753-8818/52/2024ch0129
2024-09-27
Abstract:Over the past two years, text-to-image diffusion models have advanced considerably. The PONY model, in particular, excels at generating high-quality anime character images from open-domain text descriptions. However, such text descriptions often lack the granularity needed for detailed control, especially in the context of complex human pose generation. To mitigate this limitation, recent research has introduced ControlNet to enhance the control capabilities of stable diffusion models. Nevertheless, the efficacy of a single model remains suboptimal for generating complex poses, highlighting the potential of combining multiple ControlNet models. This paper introduces the Depth+OpenPose methodology, a multi-ControlNet approach that enables simultaneous local control of depth maps and pose maps, in addition to other global controls. Distinct from single or other combined methods, Depth+OpenPose incorporates an additional conditional input. For addressing limb occlusion issues, depth maps provide positional relationships, while OpenPose captures facial expressions and hand poses, surpassing the performance of single models. Furthermore, Depth+OpenPose demonstrates superior speed and quality relative to other combinations. It is crucial to note that an excessive number of combinations can lead to too many conditional inputs, thereby reducing control efficacy. Through comprehensive quantitative and qualitative experimental comparisons, Depth+OpenPose proves its superiority in terms of speed, image quality, and versatility over existing methodologies.
What problem does this paper attempt to address?