Abstract:State-of-the-art text-to-image (T2I) diffusion models often struggle to generate rare compositions of concepts, e.g., objects with unusual attributes. In this paper, we show that the compositional generation power of diffusion models on such rare concepts can be significantly enhanced by the Large Language Model (LLM) guidance. We start with empirical and theoretical analysis, demonstrating that exposing frequent concepts relevant to the target rare concepts during the diffusion sampling process yields more accurate concept composition. Based on this, we propose a training-free approach, R2F, that plans and executes the overall rare-to-frequent concept guidance throughout the diffusion inference by leveraging the abundant semantic knowledge in LLMs. Our framework is flexible across any pre-trained diffusion models and LLMs, and can be seamlessly integrated with the region-guided diffusion approaches. Extensive experiments on three datasets, including our newly proposed benchmark, RareBench, containing various prompts with rare compositions of concepts, R2F significantly surpasses existing models including SD3.0 and FLUX by up to 28.1%p in T2I alignment. Code is available at <a class="link-external link-https" href="https://github.com/krafton-ai/Rare2Frequent" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the challenges faced by current state-of-the-art text-to-image (T2I) diffusion models when generating rare concept combinations. Specifically, these models perform poorly when dealing with objects with unusual attributes, such as "a squirrel with a beard holding a guitar" or "a beautiful octopus with a wig juggling 3 star-shaped apples." While existing pre-trained and large language model (LLM)-based T2I diffusion models (such as SD3.0, FLUX, and RPG) excel at generating common concepts, they struggle with generating rare concepts. ### Solution To address this issue, the authors propose a training-free method called **R2F (Rare-to-Frequent)**, which leverages the rich semantic knowledge of large language models (LLMs) to guide the diffusion model's inference process. The main contributions of R2F include: 1. **Theoretical Analysis**: Through empirical and theoretical analysis, the authors find that exposing common concepts related to the target rare concept during the diffusion sampling process can improve the accuracy of concept combinations. 2. **Framework Design**: R2F achieves its goals through the following two stages: - **Rare-to-Frequent Concept Mapping**: Using LLMs to decompose the input text into sub-prompts and identify the rare concepts within them, then finding common concepts related to these rare concepts. - **Alternating Concept Guidance**: Alternating the use of rare and common concept prompts during the diffusion inference process to generate more accurate images. 3. **Extension to Region-Guided Diffusion Models**: R2F+ further extends R2F by generating more complex multi-object images through region control techniques such as cross-attention control and latent fusion. ### Experimental Validation To validate the effectiveness of R2F, the authors propose a new benchmark dataset **RareBench**, which contains prompts of various rare concept combinations. Experimental results show that R2F significantly outperforms existing models like SD3.0 and FLUX in generating rare concept images, with up to a 28.1% improvement in T2I alignment accuracy. ### Conclusion By leveraging the knowledge of LLMs, R2F effectively addresses the shortcomings of existing T2I diffusion models in generating rare concept combinations, providing a new solution for generating high-quality rare concept images.

Rare-to-Frequent: Unlocking Compositional Generation Power of Diffusion Models on Rare Concepts with LLM Guidance

Multimodal Latent Language Modeling with Next-Token Diffusion

IterComp: Iterative Composition-Aware Feedback Learning from Model Gallery for Text-to-Image Generation

Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs

LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models

Lost in Translation: Latent Concept Misalignment in Text-to-Image Diffusion Models

Compositional Abilities Emerge Multiplicatively: Exploring Diffusion Models on a Synthetic Task

Explore In-Context Segmentation via Latent Diffusion Models

Progressive Compositionality In Text-to-Image Generative Models

Generating Images of Rare Concepts Using Pre-trained Diffusion Models

ConceptLab: Creative Concept Generation using VLM-Guided Diffusion Prior Constraints

Training-Free Structured Diffusion Guidance for Compositional Text-to-Image Synthesis

RADM-DRE:Retrieval Augmentation for Document-Level Relation Extraction with Diffusion Model

Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC

LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models

Object-Centric Slot Diffusion

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

LLM4GEN: Leveraging Semantic Representation of LLMs for Text-to-Image Generation

UniFL: Improve Latent Diffusion Model via Unified Feedback Learning

Diffusion Beats Autoregressive: An Evaluation of Compositional Generation in Text-to-Image Models

CommonCanvas: An Open Diffusion Model Trained with Creative-Commons Images