Abstract:Current subject-driven image generation methods encounter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely, to generate realistic persons, they need to sufficiently tune the pre-trained model, which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover, even with sufficient fine-tuning, these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper, we propose Face-diffuser, an effective collaborative generation pipeline to eliminate the above training imbalance and quality compromise. Specifically, we first develop two specialized pre-trained diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM), for scene and person generation, respectively. The sampling process is divided into three sequential stages, i.e., semantic scene construction, subject-scene fusion, and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage, that is the collaboration achieved through a novel and highly effective mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step, SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser.

What problem does this paper attempt to address?

This paper attempts to solve two main problems faced in the human - centered image generation task: 1. **Sub - optimal human portrait generation**: Current methods, when jointly learning scenes and human portrait generation, will lead to a decline in the quality of human portrait generation due to unbalanced training. Even after sufficient fine - tuning, these methods are still unable to generate high - fidelity human portraits. 2. **Catastrophic scene prior forgetting**: In order to generate realistic human portraits, it is necessary to fully fine - tune the pre - trained model, which will cause the model to forget the rich scene prior knowledge, so that the scene generation over - fits to the training data. The root cause of these problems is that the existing methods simultaneously learn the generation of scenes and human portraits by fine - tuning a general pre - trained diffusion model, and this joint learning method leads to unbalanced training and quality compromise. To solve these problems, the authors propose **Face - diffuser**, an effective collaborative generation pipeline, which specifically includes the following aspects: - **Independent scene and human portrait generation models**: The authors developed two specialized pre - trained diffusion models, namely the text - driven diffusion model (TDM) and the subject - enhanced diffusion model (SDM), which are respectively used for scene and human portrait generation. - **Three - stage sampling process**: The entire sampling process is divided into three consecutive stages: semantic scene construction, subject - scene fusion, and subject enhancement. - **Saliency - adaptive noise fusion (SNF) mechanism**: This is a novel and efficient collaborative mechanism. Based on the classifier - free guidance (CFG) response, it automatically assigns the generation responsibilities of different models at each time step, thereby achieving spatial noise prediction fusion. Through these designs, Face - diffuser can effectively eliminate the problems of unbalanced training and quality compromise, and generate high - fidelity human portraits and diverse semantic scenes. Experimental results show that Face - diffuser performs excellently in generating high - quality images, especially when dealing with multiple unseen people and different backgrounds.

High-fidelity Person-centric Subject-to-Image Synthesis

Multimodal-driven Talking Face Generation, Face Swapping, Diffusion Model

Novel 3D-Aware Composition Images Synthesis for Object Display with Diffusion Model.

Collaborative Diffusion for Multi-Modal Face Generation and Editing

HumanDiffusion: a Coarse-to-Fine Alignment Diffusion Framework for Controllable Text-Driven Person Image Generation

MM-Diff: High-Fidelity Image Personalization via Multi-Modal Condition Integration

Scene Diffusion: Text-driven Scene Image Synthesis Conditioning on a Single 3D Model

Make-Your-3D: Fast and Consistent Subject-Driven 3D Content Generation

Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis

ComFusion: Personalized Subject Generation in Multiple Specific Scenes From Single Image

HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models

Subject-Diffusion:Open Domain Personalized Text-to-Image Generation without Test-time Fine-tuning

BeyondScene: Higher-Resolution Human-Centric Scene Generation With Pretrained Diffusion

Dense-Face: Personalized Face Generation Model via Dense Annotation Prediction

Semantic Image Synthesis Via Diffusion Models

Consistent Human Image and Video Generation with Spatially Conditioned Diffusion

Advancing Pose-Guided Image Synthesis with Progressive Conditional Diffusion Models

Gen-3Diffusion: Realistic Image-to-3D Generation via 2D & 3D Diffusion Synergy

DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance