Controllable Person Image Synthesis with Pose-Constrained Latent Diffusion

Xiao Han,Xiatian Zhu,Jiankang Deng,Yi-Zhe Song,Tao Xiang
DOI: https://doi.org/10.1109/iccv51070.2023.02081
2023-01-01
Abstract:Controllable person image synthesis aims at rendering a source image based on user-specified changes in body pose or appearance. Prior art approaches leverage pixel-level denoising diffusion models conditioned on the coarse skeleton via cross-attention. This leads to two limitations: low efficiency and inaccurate condition information. To address both issues, a novel Pose-Constrained Latent Diffusion model (PoCoLD) is introduced. Rather than using the skeleton as a sparse pose representation, we exploit DensePose which offers much richer body structure information. To effectively capitalize DensePose at a low cost, we propose an efficient pose-constrained attention module that is capable of modeling the complex interplay between appearance and pose. Extensive experiments show that our PoCoLD outperforms the state-of-the-art competitors in image synthesis fidelity. Critically, it runs 2× faster and consumes 3.6× smaller memory than the latest diffusion-model-based alternative during inference. Our code and models are available at https://github.com/BrandonHanx/PoCoLD.
What problem does this paper attempt to address?