EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM

Zhuofan Zong,Dongzhi Jiang,Bingqi Ma,Guanglu Song,Hao Shao,Dazhong Shen,Yu Liu,Hongsheng Li
2024-12-13
Abstract:Significant achievements in personalization of diffusion models have been witnessed. Conventional tuning-free methods mostly encode multiple reference images by averaging their image embeddings as the injection condition, but such an image-independent operation cannot perform interaction among images to capture consistent visual elements within multiple references. Although the tuning-based Low-Rank Adaptation (LoRA) can effectively extract consistent elements within multiple images through the training process, it necessitates specific finetuning for each distinct image group. This paper introduces EasyRef, a novel plug-and-play adaptation method that enables diffusion models to be conditioned on multiple reference images and the text prompt. To effectively exploit consistent visual elements within multiple images, we leverage the multi-image comprehension and instruction-following capabilities of the multimodal large language model (MLLM), prompting it to capture consistent visual elements based on the instruction. Besides, injecting the MLLM's representations into the diffusion process through adapters can easily generalize to unseen domains, mining the consistent visual elements within unseen data. To mitigate computational costs and enhance fine-grained detail preservation, we introduce an efficient reference aggregation strategy and a progressive training scheme. Finally, we introduce MRBench, a new multi-reference image generation benchmark. Experimental results demonstrate EasyRef surpasses both tuning-free methods like IP-Adapter and tuning-based methods like LoRA, achieving superior aesthetic quality and robust zero-shot generalization across diverse domains.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in Diffusion Models, how to effectively use multiple reference images and text prompts to generate high - quality images that are consistent with the reference images. Specifically, the paper focuses on the limitations of existing methods in handling multiple reference images, such as the inability to capture the consistent visual elements among multiple reference images, the lack of generalization ability for unseen data, and high computational costs. ### Main Problem Analysis 1. **Limitations of Multi - Reference Image Encoding**: - Traditional methods usually take a simple average of the embedding vectors of multiple reference images as the injection condition, and this method cannot effectively capture the interactions and consistent visual elements among the reference images. - This method may also lead to the problem of spatial misalignment, that is, when the target object is in different positions in different reference images, the generated images may be inconsistent (as shown in Figure 2). 2. **Lack of Zero - Shot Generalization Ability**: - Existing fine - tuning - based methods (such as LoRA) can extract the consistent elements among multiple images, but they need specific fine - tuning for each different set of images and lack zero - shot generalization ability. 3. **High Computational Costs**: - When dealing with multiple reference images, the computational costs will increase significantly, especially in the case of dealing with a large number of reference images. ### Solutions To solve the above problems, the paper proposes EasyRef, a novel plug - and - play adaptation method that enables diffusion models to be conditioned on multiple reference images and text prompts simultaneously. Specific solutions include: 1. **Utilizing Multi - Modal Large Language Models (MLLM)**: - EasyRef utilizes the powerful multi - image understanding and instruction - following abilities of MLLM, and guides MLLM to capture the consistent visual elements among multiple reference images through instructions. - Inject the representation of MLLM into the diffusion process through an adapter, thereby achieving generalization to unseen domains. 2. **Efficient Reference Aggregation Strategy**: - Propose an efficient reference aggregation strategy, encapsulate the reference representations into learnable reference tokens to reduce computational costs. 3. **Progressive Training Scheme**: - Adopt a progressive training scheme to gradually enhance the MLLM's ability to capture fine - grained visual details. 4. **Introducing MRBench Benchmark**: - Construct a new multi - reference image generation benchmark (MRBench) for evaluating multi - reference image generation tasks. Through these innovations, EasyRef performs excellently in multi - reference image generation tasks, not only surpassing existing methods in aesthetic quality but also demonstrating strong zero - shot generalization ability. ### Summary The main contributions of the paper are: - Proposing EasyRef, a diffusion model adaptation technique that can be jointly conditioned on multiple reference images and text prompts. - Designing an efficient reference aggregation strategy and a progressive training scheme, reducing computational costs and enhancing fine - grained perception ability. - Introducing MRBench, providing a new evaluation benchmark for multi - reference image generation tasks. Through these improvements, EasyRef has made significant progress in the field of multi - reference image generation.