Abstract:Diffusion-based models have demonstrated impressive capabilities for text-to-image generation and are expected for personalized applications of subject-driven generation, which require the generation of customized concepts with one or a few reference images. However, existing methods based on fine-tuning fail to balance the trade-off between subject learning and the maintenance of the generation capabilities of pretrained models. Moreover, other methods that utilize additional image encoders tend to lose important details of the subject due to encoding compression. To address these challenges, we propose DreamTurner, a novel method that injects reference information from coarse to fine to achieve subject-driven image generation more effectively. DreamTurner introduces a subject-encoder for coarse subject identity preservation, where the compressed general subject features are introduced through an attention layer before visual-text cross-attention. We then modify the self-attention layers within pretrained text-to-image models to self-subject-attention layers to refine the details of the target subject. The generated image queries detailed features from both the reference image and itself in self-subject-attention. It is worth emphasizing that self-subject-attention is an effective, elegant, and training-free method for maintaining the detailed features of customized subjects and can serve as a plug-and-play solution during inference. Finally, with additional subject-driven fine-tuning, DreamTurner achieves remarkable performance in subject-driven image generation, which can be controlled by a text or other conditions such as pose. For further details, please visit the project page at <a class="link-external link-https" href="https://dreamtuner-diffusion.github.io/" rel="external noopener nofollow">this https URL</a>.

SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation

EZIGen: Enhancing zero-shot personalized image generation with precise subject encoding and decoupled guidance

DreamTuner: Single Image is Enough for Subject-Driven Generation

Single Remote Sensing Image Super-Resolution Via a Generative Adversarial Network with Stratified Dense Sampling and Chain Training

SSR: SAM is a Strong Regularizer for domain adaptive semantic segmentation

Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation

SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis

FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance

DisenDreamer: Subject-Driven Text-to-Image Generation with Sample-aware Disentangled Tuning

Subject-driven Text-to-Image Generation via Preference-based Reinforcement Learning

HybridBooth: Hybrid Prompt Inversion for Efficient Subject-Driven Generation

Large-Scale Text-to-Image Model with Inpainting is a Zero-Shot Subject-Driven Image Generator

Subject-driven Text-to-Image Generation via Apprenticeship Learning

ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation

DisEnvisioner: Disentangled and Enriched Visual Prompt for Customized Image Generation

Semantic Image Synthesis with Unconditional Generator

EGDSR: Encoder-Generator-Decoder Network for Remote Sensing Super-Resolution Reconstruction