In-Context LoRA for Diffusion Transformers

Lianghua Huang,Wei Wang,Zhi-Fan Wu,Yupeng Shi,Huanzhang Dou,Chen Liang,Yutong Feng,Yu Liu,Jingren Zhou
2024-11-01
Abstract:Recent research <a class="link-https" data-arxiv-id="2410.15027" href="https://arxiv.org/abs/2410.15027">arXiv:2410.15027</a> has explored the use of diffusion transformers (DiTs) for task-agnostic image generation by simply concatenating attention tokens across images. However, despite substantial computational resources, the fidelity of the generated images remains suboptimal. In this study, we reevaluate and streamline this framework by hypothesizing that text-to-image DiTs inherently possess in-context generation capabilities, requiring only minimal tuning to activate them. Through diverse task experiments, we qualitatively demonstrate that existing text-to-image DiTs can effectively perform in-context generation without any tuning. Building on this insight, we propose a remarkably simple pipeline to leverage the in-context abilities of DiTs: (1) concatenate images instead of tokens, (2) perform joint captioning of multiple images, and (3) apply task-specific LoRA tuning using small datasets (e.g., $20\sim 100$ samples) instead of full-parameter tuning with large datasets. We name our models In-Context LoRA (IC-LoRA). This approach requires no modifications to the original DiT models, only changes to the training data. Remarkably, our pipeline generates high-fidelity image sets that better adhere to prompts. While task-specific in terms of tuning data, our framework remains task-agnostic in architecture and pipeline, offering a powerful tool for the community and providing valuable insights for further research on product-level task-agnostic generation systems. We release our code, data, and models at <a class="link-external link-https" href="https://github.com/ali-vilab/In-Context-LoRA" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition,Graphics
What problem does this paper attempt to address?
The problem this paper attempts to address is how to adapt text-to-image generation models (such as diffusion transformers) to various generation tasks, especially those that require generating image sets with complex intrinsic relationships. Although existing methods (such as Group Diffusion Transformers, GDT) can generate image sets in an unsupervised manner, their generation quality is often inferior to pre-trained text-to-image models. Therefore, this paper proposes a new method—In-Context LoRA (IC-LoRA), aiming to activate the contextual generation capabilities of existing text-to-image models with minimal adjustments and a small amount of data, thereby improving the quality and consistency of generated images. Specifically, the main contributions of the paper include: 1. **Assuming existing models have contextual generation capabilities**: The authors believe that existing text-to-image models already possess contextual generation capabilities, which can be used for complex generation tasks through appropriate triggering and enhancement. 2. **No need to modify the model architecture**: By changing the input data instead of modifying the model architecture, existing text-to-image models can be reused for contextual generation. 3. **Efficient use of small amounts of data and computational resources**: High-quality results can be achieved with a small amount of high-quality data and minimal computational resources. To validate these assumptions, the authors designed a simple and effective pipeline, mainly including the following steps: 1. **Image stitching**: Stitch multiple images into one large image instead of stitching attention tokens. 2. **Prompt stitching**: Merge the prompts of each image into one long prompt, enabling the model to process and generate multiple images simultaneously. 3. **Low-rank adaptation (LoRA) fine-tuning on small datasets**: Fine-tune the model using a small set of high-quality images to trigger and enhance its contextual generation capabilities. Through these methods, the authors demonstrate their model's ability to generate high-quality image sets in various tasks, including storyboard generation, font design, portrait photography, visual identity design, home decoration, and more. Additionally, this method supports reference image-based generation tasks, further enhancing the model's flexibility and applicability.