Abstract:Recent research <a class="link-https" data-arxiv-id="2410.15027" href="https://arxiv.org/abs/2410.15027">arXiv:2410.15027</a> has explored the use of diffusion transformers (DiTs) for task-agnostic image generation by simply concatenating attention tokens across images. However, despite substantial computational resources, the fidelity of the generated images remains suboptimal. In this study, we reevaluate and streamline this framework by hypothesizing that text-to-image DiTs inherently possess in-context generation capabilities, requiring only minimal tuning to activate them. Through diverse task experiments, we qualitatively demonstrate that existing text-to-image DiTs can effectively perform in-context generation without any tuning. Building on this insight, we propose a remarkably simple pipeline to leverage the in-context abilities of DiTs: (1) concatenate images instead of tokens, (2) perform joint captioning of multiple images, and (3) apply task-specific LoRA tuning using small datasets (e.g., $20\sim 100$ samples) instead of full-parameter tuning with large datasets. We name our models In-Context LoRA (IC-LoRA). This approach requires no modifications to the original DiT models, only changes to the training data. Remarkably, our pipeline generates high-fidelity image sets that better adhere to prompts. While task-specific in terms of tuning data, our framework remains task-agnostic in architecture and pipeline, offering a powerful tool for the community and providing valuable insights for further research on product-level task-agnostic generation systems. We release our code, data, and models at <a class="link-external link-https" href="https://github.com/ali-vilab/In-Context-LoRA" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

The problem this paper attempts to address is how to adapt text-to-image generation models (such as diffusion transformers) to various generation tasks, especially those that require generating image sets with complex intrinsic relationships. Although existing methods (such as Group Diffusion Transformers, GDT) can generate image sets in an unsupervised manner, their generation quality is often inferior to pre-trained text-to-image models. Therefore, this paper proposes a new method—In-Context LoRA (IC-LoRA), aiming to activate the contextual generation capabilities of existing text-to-image models with minimal adjustments and a small amount of data, thereby improving the quality and consistency of generated images. Specifically, the main contributions of the paper include: 1. **Assuming existing models have contextual generation capabilities**: The authors believe that existing text-to-image models already possess contextual generation capabilities, which can be used for complex generation tasks through appropriate triggering and enhancement. 2. **No need to modify the model architecture**: By changing the input data instead of modifying the model architecture, existing text-to-image models can be reused for contextual generation. 3. **Efficient use of small amounts of data and computational resources**: High-quality results can be achieved with a small amount of high-quality data and minimal computational resources. To validate these assumptions, the authors designed a simple and effective pipeline, mainly including the following steps: 1. **Image stitching**: Stitch multiple images into one large image instead of stitching attention tokens. 2. **Prompt stitching**: Merge the prompts of each image into one long prompt, enabling the model to process and generate multiple images simultaneously. 3. **Low-rank adaptation (LoRA) fine-tuning on small datasets**: Fine-tune the model using a small set of high-quality images to trigger and enhance its contextual generation capabilities. Through these methods, the authors demonstrate their model's ability to generate high-quality image sets in various tasks, including storyboard generation, font design, portrait photography, visual identity design, home decoration, and more. Additionally, this method supports reference image-based generation tasks, further enhancing the model's flexibility and applicability.

In-Context LoRA for Diffusion Transformers

LoRA Fusion: Enhancing Image Generation

Video Diffusion Transformers are In-Context Learners

LoRA-Composer: Leveraging Low-Rank Adaptation for Multi-Concept Customization in Training-Free Diffusion Models

Multi-LoRA Composition for Image Generation

DiffLoRA: Generating Personalized Low-Rank Adaptation Weights with Diffusion

LaDiC: Are Diffusion Models Really Inferior to Autoregressive Counterparts for Image-to-Text Generation?

CLoRA: A Contrastive Approach to Compose Multiple LoRA Models

Block-wise LoRA: Revisiting Fine-grained LoRA for Effective Personalization and Stylization in Text-to-Image Generation

CtrLoRA: An Extensible and Efficient Framework for Controllable Image Generation

LoRA Diffusion: Zero-Shot LoRA Synthesis for Diffusion Model Personalization

ChatDiT: A Training-Free Baseline for Task-Agnostic Free-Form Chatting with Diffusion Transformers

Paragraph-to-Image Generation with Information-Enriched Diffusion Model

LCM-LoRA: A Universal Stable-Diffusion Acceleration Module

IntLoRA: Integral Low-rank Adaptation of Quantized Diffusion Models

Group Diffusion Transformers are Unsupervised Multitask Learners

Latent Space Disentanglement in Diffusion Transformers Enables Zero-shot Fine-grained Semantic Editing

TerDiT: Ternary Diffusion Models with Transformers

A LoRA is Worth a Thousand Pictures

TextDiffuser: Diffusion Models as Text Painters

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models