Mingdeng Cao,Chong Mou,Ziyang Yuan,Xintao Wang,Zhaoyang Zhang,Ying Shan,Yinqiang Zheng
Abstract:Consistent human-centric image and video synthesis aims to generate images or videos with new poses while preserving appearance consistency with a given reference image, which is crucial for low-cost visual content creation. Recent advances based on diffusion models typically rely on separate networks for reference appearance feature extraction and target visual generation, leading to inconsistent domain gaps between references and targets. In this paper, we frame the task as a spatially-conditioned inpainting problem, where the target image is inpainted to maintain appearance consistency with the reference. This approach enables the reference features to guide the generation of pose-compliant targets within a unified denoising network, thereby mitigating domain gaps. Additionally, to better maintain the reference appearance information, we impose a causal feature interaction framework, in which reference features can only query from themselves, while target features can query appearance information from both the reference and the target. To further enhance computational efficiency and flexibility, in practical implementation, we decompose the spatially-conditioned generation process into two stages: reference appearance extraction and conditioned target generation. Both stages share a single denoising network, with interactions restricted to self-attention layers. This proposed method ensures flexible control over the appearance of generated human images and videos. By fine-tuning existing base diffusion models on human video data, our method demonstrates strong generalization to unseen human identities and poses without requiring additional per-instance fine-tuning. Experimental results validate the effectiveness of our approach, showing competitive performance compared to existing methods for consistent human image and video synthesis.
What problem does this paper attempt to address?
### What problem does this paper attempt to solve?
This paper aims to solve the problem of "high - quality human image and video generation while maintaining appearance consistency". Specifically, given a reference image, it hopes to generate new images or videos that can conform to the specified pose while maintaining appearance consistency with the reference image. This is of great significance in low - cost visual content creation.
#### Main challenges:
1. **Maintaining appearance consistency**: Especially in details, ensure the consistency between the generated image or video and the reference image.
2. **Accurately aligning the target pose**: Ensure that the generated content can precisely follow the given target pose.
#### Shortcomings of existing methods:
- **Traditional methods** (such as GAN - based methods) usually rely on estimating the correspondence between the reference image and the target image, then use a deformation module to adjust the reference image to the target pose, and finally generate the final result through a conditional GAN. These methods are often difficult to preserve small details, resulting in problems such as low resolution, distortion, detail loss and appearance inconsistency.
- **Diffusion - model - based methods** Although they perform well in generating realistic images and videos, they still face challenges in maintaining fine - grained details. For example, the CLIP encoder is good at embedding semantic information, but has difficulty in capturing discriminative representations to maintain appearance. In addition, the channel - splicing method tends to prioritize spatial layout over identity and appearance consistency.
#### Proposed solution:
This paper proposes a new method named **Spatially Conditioned Diffusion (SCD)**, and the main innovations include:
1. **Framing the task as a spatially - conditioned inpainting problem**: By guiding the reference features in a unified denoising network to generate a target image that conforms to the pose, thereby reducing the domain gap.
2. **Introducing a causal feature interaction mechanism**: Ensure that the reference features can only be queried from themselves, while the target features can be queried from the reference and target features, in order to better preserve the fine - grained appearance details of the reference image.
3. **Implementing the spatially - conditioned generation process in stages**: Divided into two stages of reference appearance extraction and conditional target generation, sharing the same denoising network, enhancing flexibility and efficiency.
4. **Fine - tuning the existing basic diffusion model**: By fine - tuning the existing diffusion model to adapt to human video data, this method demonstrates strong generalization ability for unseen human identities and poses without additional per - instance fine - tuning.
Through these improvements, the SCD model has verified its effectiveness and competitiveness in the experimental results, and can generate high - quality and appearance - consistent human images and videos.