Abstract:Current subject-driven image generation methods encounter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely, to generate realistic persons, they need to sufficiently tune the pre-trained model, which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover, even with sufficient fine-tuning, these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper, we propose Face-diffuser, an effective collaborative generation pipeline to eliminate the above training imbalance and quality compromise. Specifically, we first develop two specialized pre-trained diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM), for scene and person generation, respectively. The sampling process is divided into three sequential stages, i.e., semantic scene construction, subject-scene fusion, and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage, that is the collaboration achieved through a novel and highly effective mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step, SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser.
What problem does this paper attempt to address?
This paper attempts to solve two main problems faced in the human - centered image generation task:
1. **Sub - optimal human portrait generation**: Current methods, when jointly learning scenes and human portrait generation, will lead to a decline in the quality of human portrait generation due to unbalanced training. Even after sufficient fine - tuning, these methods are still unable to generate high - fidelity human portraits.
2. **Catastrophic scene prior forgetting**: In order to generate realistic human portraits, it is necessary to fully fine - tune the pre - trained model, which will cause the model to forget the rich scene prior knowledge, so that the scene generation over - fits to the training data.
The root cause of these problems is that the existing methods simultaneously learn the generation of scenes and human portraits by fine - tuning a general pre - trained diffusion model, and this joint learning method leads to unbalanced training and quality compromise.
To solve these problems, the authors propose **Face - diffuser**, an effective collaborative generation pipeline, which specifically includes the following aspects:
- **Independent scene and human portrait generation models**: The authors developed two specialized pre - trained diffusion models, namely the text - driven diffusion model (TDM) and the subject - enhanced diffusion model (SDM), which are respectively used for scene and human portrait generation.
- **Three - stage sampling process**: The entire sampling process is divided into three consecutive stages: semantic scene construction, subject - scene fusion, and subject enhancement.
- **Saliency - adaptive noise fusion (SNF) mechanism**: This is a novel and efficient collaborative mechanism. Based on the classifier - free guidance (CFG) response, it automatically assigns the generation responsibilities of different models at each time step, thereby achieving spatial noise prediction fusion.
Through these designs, Face - diffuser can effectively eliminate the problems of unbalanced training and quality compromise, and generate high - fidelity human portraits and diverse semantic scenes. Experimental results show that Face - diffuser performs excellently in generating high - quality images, especially when dealing with multiple unseen people and different backgrounds.