Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance

Dongmin Park,Sebin Kim,Taehong Moon,Minkyu Kim,Kangwook Lee,Jaewoong Cho
2024-10-29
Abstract:State-of-the-art text-to-image (T2I) diffusion models often struggle to generate rare compositions of concepts, e.g., objects with unusual attributes. In this paper, we show that the compositional generation power of diffusion models on such rare concepts can be significantly enhanced by the Large Language Model (LLM) guidance. We start with empirical and theoretical analysis, demonstrating that exposing frequent concepts relevant to the target rare concepts during the diffusion sampling process yields more accurate concept composition. Based on this, we propose a training-free approach, R2F, that plans and executes the overall rare-to-frequent concept guidance throughout the diffusion inference by leveraging the abundant semantic knowledge in LLMs. Our framework is flexible across any pre-trained diffusion models and LLMs, and can be seamlessly integrated with the region-guided diffusion approaches. Extensive experiments on three datasets, including our newly proposed benchmark, RareBench, containing various prompts with rare compositions of concepts, R2F significantly surpasses existing models including SD3.0 and FLUX by up to 28.1%p in T2I alignment. Code is available at <a class="link-external link-https" href="https://github.com/krafton-ai/Rare2Frequent" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the challenges faced by current state-of-the-art text-to-image (T2I) diffusion models when generating rare concept combinations. Specifically, these models perform poorly when dealing with objects with unusual attributes, such as "a squirrel with a beard holding a guitar" or "a beautiful octopus with a wig juggling 3 star-shaped apples." While existing pre-trained and large language model (LLM)-based T2I diffusion models (such as SD3.0, FLUX, and RPG) excel at generating common concepts, they struggle with generating rare concepts. ### Solution To address this issue, the authors propose a training-free method called **R2F (Rare-to-Frequent)**, which leverages the rich semantic knowledge of large language models (LLMs) to guide the diffusion model's inference process. The main contributions of R2F include: 1. **Theoretical Analysis**: Through empirical and theoretical analysis, the authors find that exposing common concepts related to the target rare concept during the diffusion sampling process can improve the accuracy of concept combinations. 2. **Framework Design**: R2F achieves its goals through the following two stages: - **Rare-to-Frequent Concept Mapping**: Using LLMs to decompose the input text into sub-prompts and identify the rare concepts within them, then finding common concepts related to these rare concepts. - **Alternating Concept Guidance**: Alternating the use of rare and common concept prompts during the diffusion inference process to generate more accurate images. 3. **Extension to Region-Guided Diffusion Models**: R2F+ further extends R2F by generating more complex multi-object images through region control techniques such as cross-attention control and latent fusion. ### Experimental Validation To validate the effectiveness of R2F, the authors propose a new benchmark dataset **RareBench**, which contains prompts of various rare concept combinations. Experimental results show that R2F significantly outperforms existing models like SD3.0 and FLUX in generating rare concept images, with up to a 28.1% improvement in T2I alignment accuracy. ### Conclusion By leveraging the knowledge of LLMs, R2F effectively addresses the shortcomings of existing T2I diffusion models in generating rare concept combinations, providing a new solution for generating high-quality rare concept images.