Abstract:Diffusion-based models have demonstrated impressive capabilities for text-to-image generation and are expected for personalized applications of subject-driven generation, which require the generation of customized concepts with one or a few reference images. However, existing methods based on fine-tuning fail to balance the trade-off between subject learning and the maintenance of the generation capabilities of pretrained models. Moreover, other methods that utilize additional image encoders tend to lose important details of the subject due to encoding compression. To address these challenges, we propose DreamTurner, a novel method that injects reference information from coarse to fine to achieve subject-driven image generation more effectively. DreamTurner introduces a subject-encoder for coarse subject identity preservation, where the compressed general subject features are introduced through an attention layer before visual-text cross-attention. We then modify the self-attention layers within pretrained text-to-image models to self-subject-attention layers to refine the details of the target subject. The generated image queries detailed features from both the reference image and itself in self-subject-attention. It is worth emphasizing that self-subject-attention is an effective, elegant, and training-free method for maintaining the detailed features of customized subjects and can serve as a plug-and-play solution during inference. Finally, with additional subject-driven fine-tuning, DreamTurner achieves remarkable performance in subject-driven image generation, which can be controlled by a text or other conditions such as pose. For further details, please visit the project page at <a class="link-external link-https" href="https://dreamtuner-diffusion.github.io/" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: how to use a single reference image to achieve high - quality subject - driven image generation. Specifically, the author hopes to generate high - fidelity images consistent with text or other conditions (such as pose) while maintaining the identity characteristics of the subject. Existing methods either have difficulty in balancing subject learning and generation ability during the fine - tuning process, or lose important details of the subject when using an additional image encoder. Therefore, this paper proposes a new method - DreamTuner, which aims to more effectively achieve subject - driven image generation by injecting reference information from coarse to fine. ### Main Problem Summary 1. **Subject - Driven Image Generation with a Single Reference Image**: - How to generate high - fidelity images in multiple different scenarios using only one reference image. 2. **Maintaining Subject Identity Characteristics**: - When generating new images, how to ensure that the generated images retain the subject identity characteristics in the original reference image. 3. **Controlling the Consistency of Generated Content**: - How to make the generated images consistent with the input text or conditions (such as pose). ### DreamTuner's Solutions To address the above challenges, DreamTuner proposes the following key techniques: 1. **Subject - Encoder**: - It is used to roughly retain the identity characteristics of the subject. It extracts the compressed features of the reference image through a pre - trained CLIP image encoder and injects them into the text - to - image model through an attention layer. 2. **Self - Subject - Attention**: - By modifying the self - attention layer to a self - subject - attention layer, the identity characteristics of the subject are further refined. The self - subject - attention layer can extract detailed features from the reference image and the generated image, and enhance control through weight and mask strategies. 3. **Subject - Driven Fine - tuning**: - Fine - tune the model in a small number of training steps to better retain the identity characteristics of the subject. This method combines the advantages of the subject - encoder and self - subject - attention and can achieve better results in less training time. ### Experimental Results The experimental results show that DreamTuner performs excellently on multiple datasets, especially outperforming existing methods in maintaining subject identity characteristics. For example, when generating images of static objects, animals, and anime characters, DreamTuner can well retain the detailed characteristics of the subject, such as the text on the can, the white stripes on the dog, the eyes and clothes of the anime character, etc. In conclusion, DreamTuner successfully solves the problem of high - quality subject - driven image generation using only a single reference image, while maintaining the subject identity characteristics and text consistency when generating images.

DreamTuner: Single Image is Enough for Subject-Driven Generation

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

HiFi Tuner: High-Fidelity Subject-Driven Fine-Tuning for Diffusion Models

DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Positive-Negative Prompt-Tuning

DreamBlend: Advancing Personalized Fine-tuning of Text-to-Image Diffusion Models

DisenDreamer: Subject-Driven Text-to-Image Generation with Sample-aware Disentangled Tuning

Subject-driven Text-to-Image Generation via Apprenticeship Learning

FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

VideoDreamer: Customized Multi-Subject Text-to-Video Generation with Disen-Mix Finetuning

DreamEdit: Subject-driven Image Editing

FreeTuner: Any Subject in Any Style with Training-free Diffusion

Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance

Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning

Resolving Multi-Condition Confusion for Finetuning-Free Personalized Image Generation

DreamVideo: High-Fidelity Image-to-Video Generation with Image Retention and Text Guidance

FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition

EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance

AttnDreamBooth: Towards Text-Aligned Personalized Text-to-Image Generation

DreamVideo: Composing Your Dream Videos with Customized Subject and Motion