Abstract:Large pretrained diffusion models have demonstrated impressive generation capabilities and have been adapted to various downstream tasks. However, unlike Large Language Models (LLMs) that can learn multiple tasks in a single model based on instructed data, diffusion models always require additional branches, task-specific training strategies, and losses for effective adaptation to different downstream tasks. This task-specific fine-tuning approach brings two drawbacks. 1) The task-specific additional networks create gaps between pretraining and fine-tuning which hinders the transfer of pretrained knowledge. 2) It necessitates careful additional network design, raising the barrier to learning and implementation, and making it less user-friendly. Thus, a question arises: Can we achieve a simple, efficient, and general approach to fine-tune diffusion models? To this end, we propose ONE-PIC. It enhances the inherited generative ability in the pretrained diffusion models without introducing additional modules. Specifically, we propose In-Visual-Context Tuning, which constructs task-specific training data by arranging source images and target images into a single image. This approach makes downstream fine-tuning closer to the pertaining, allowing our model to adapt more quickly to various downstream tasks. Moreover, we propose a Masking Strategy to unify different generative tasks. This strategy transforms various downstream fine-tuning tasks into predictions of the masked portions. The extensive experimental results demonstrate that our method is simple and efficient which streamlines the adaptation process and achieves excellent performance with lower costs. Code is available at <a class="link-external link-https" href="https://github.com/tobran/ONE-PIC" rel="external noopener nofollow">this https URL</a>.

Parameter efficient finetuning of text-to-image models with trainable self-attention layer

StyleInject: Parameter Efficient Tuning of Text-to-Image Diffusion Models

Prior Preserved Text-to-Image Personalization Without Image Regularization

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

PaRa: Personalizing Text-to-Image Diffusion via Parameter Rank Reduction

Key-Locked Rank One Editing for Text-to-Image Personalization

Discriminative Probing and Tuning for Text-to-Image Generation

T2I-Adapter: Learning Adapters to Dig out More Controllable Ability for Text-to-Image Diffusion Models

Customization Assistant for Text-to-image Generation

Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing

SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation

Attention Calibration for Disentangled Text-to-Image Personalization

Direct Consistency Optimization for Robust Customization of Text-to-Image Diffusion Models

Information Theoretic Text-to-Image Alignment

Unlocking the Potential of Text-to-Image Diffusion with PAC-Bayesian Theory

Training-Free Sketch-Guided Diffusion with Latent Optimization

Do We Need to Design Specific Diffusion Models for Different Tasks? Try ONE-PIC

Prompt Tuning Inversion for Text-Driven Image Editing Using Diffusion Models

Prompt-Free Diffusion: Taking "text" out of Text-to-Image Diffusion Models

Tuning-Free Image Customization with Image and Text Guidance

APrompt: Attention Prompt Tuning for Efficient Adaptation of Pre-trained Language Models