MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts

Jie Zhu,Yixiong Chen,Mingyu Ding,Ping Luo,Leye Wang,Jingdong Wang
2024-10-31
Abstract:Text-to-image diffusion has attracted vast attention due to its impressive image-generation capabilities. However, when it comes to human-centric text-to-image generation, particularly in the context of faces and hands, the results often fall short of naturalness due to insufficient training priors. We alleviate the issue in this work from two perspectives. 1) From the data aspect, we carefully collect a human-centric dataset comprising over one million high-quality human-in-the-scene images and two specific sets of close-up images of faces and hands. These datasets collectively provide a rich prior knowledge base to enhance the human-centric image generation capabilities of the diffusion model. 2) On the methodological front, we propose a simple yet effective method called Mixture of Low-rank Experts (MoLE) by considering low-rank modules trained on close-up hand and face images respectively as experts. This concept draws inspiration from our observation of low-rank refinement, where a low-rank module trained by a customized close-up dataset has the potential to enhance the corresponding image part when applied at an appropriate scale. To validate the superiority of MoLE in the context of human-centric image generation compared to state-of-the-art, we construct two benchmarks and perform evaluations with diverse metrics and human studies. Datasets, model, and code are released at <a class="link-external link-https" href="https://sites.google.com/view/mole4diffuser/" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems the Paper Aims to Solve The paper aims to address the issue of human-centric image generation in text-to-image generation, particularly in the generation of faces and hands. Current generation models have shortcomings in generating natural and realistic faces and hands, mainly due to the lack of high-quality human-centric images in the training data and the high variability of these parts. To overcome these issues, the authors approach from two aspects: 1. **Data Aspect**: A dataset containing over 1 million high-quality images of humans in scenes was collected, along with two additional subsets of specific close-up images of faces and hands. These datasets collectively provide rich prior knowledge to enhance the human-centric image generation capability of diffusion models. 2. **Method Aspect**: A simple yet effective method called Mixture of Low-rank Experts (MoLE) is proposed. By training low-rank modules on close-up images of faces and hands as experts, MoLE can refine the corresponding image parts at appropriate proportions. This method leverages the phenomenon of low-rank refinement, where customized low-rank modules can improve specific parts of the image at the appropriate scale. ### Main Contributions 1. **High-Quality Dataset**: A dataset containing over 1 million high-quality human-centric images was collected, including two high-quality subsets of close-up images of faces and hands. Notably, the close-up hand dataset has not appeared in previous research. 2. **Mixture of Low-rank Experts Method**: The phenomenon of low-rank refinement was discovered, and the MoLE method was proposed, training low-rank modules as experts and flexibly activating these experts through soft assignment. 3. **Evaluation Benchmarks**: Two evaluation benchmarks for human-centric image generation were constructed, based on COCO Caption and DiffusionDB. Experimental results show that MoLE outperforms existing methods across multiple model architectures, demonstrating broad application prospects. ### Experimental Results - **Performance Superiority**: On the COCO Human Prompts and DiffusionDB Human Prompts benchmarks, MoLE significantly outperforms VQ-Diffusion and Versatile Diffusion in HPS and IR metrics, and significantly improves the performance of the baseline model SD v1.5. - **Generalization Ability**: Besides SD v1.5, MoLE was also validated on SDXL, SD v2.1, and PixArt-α, showing good generalization ability. - **Ablation Study**: Through experiments at different stages, the improvement in generation performance at each stage was verified. Particularly, the soft assignment mechanism in the third stage effectively mitigates the negative impact of expert modules during generation, further enhancing the quality of face and hand generation. In summary, the paper significantly improves the naturalness and realism of human-centric image generation through high-quality datasets and innovative methods.