SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing

Zeyinzi Jiang,Chaojie Mao,Yulin Pan,Zhen Han,Jingfeng Zhang

2023-12-19

Abstract:Image diffusion models have been utilized in various tasks, such as text-to-image generation and controllable image synthesis. Recent research has introduced tuning methods that make subtle adjustments to the original models, yielding promising results in specific adaptations of foundational generative diffusion models. Rather than modifying the main backbone of the diffusion model, we delve into the role of skip connection in U-Net and reveal that hierarchical features aggregating long-distance information across encoder and decoder make a significant impact on the content and quality of image generation. Based on the observation, we propose an efficient generative tuning framework, dubbed SCEdit, which integrates and edits Skip Connection using a lightweight tuning module named SC-Tuner. Furthermore, the proposed framework allows for straightforward extension to controllable image synthesis by injecting different conditions with Controllable SC-Tuner, simplifying and unifying the network design for multi-condition inputs. Our SCEdit substantially reduces training parameters, memory usage, and computational expense due to its lightweight tuners, with backward propagation only passing to the decoder blocks. Extensive experiments conducted on text-to-image generation and controllable image synthesis tasks demonstrate the superiority of our method in terms of efficiency and performance. Project page: \url{<a class="link-external link-https" href="https://scedit.github.io/" rel="external noopener nofollow">this https URL</a>}

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the following issues: 1. **Efficient Fine-Tuning**: Existing image diffusion models are often inefficient or impractical for full fine-tuning on specific tasks, especially in custom scenarios, due to limited training data and computational resources. Therefore, researchers have proposed efficient fine-tuning methods to solve this problem. 2. **Controllable Image Generation**: Although existing methods can achieve a certain degree of controllable image generation, they still consume a lot of resources and are not efficient when scaling the network. For example, the LoRA method, while effective, introduces trainable low-rank matrices throughout the U-Net network, leading to increased gradient accumulation and memory usage during training. To address these issues, the paper proposes a framework called SCEdit, which achieves efficient and controllable image generation by editing the skip connections in the U-Net. Specifically, SCEdit introduces a lightweight fine-tuning module called SC-Tuner, which can edit the latent features in each skip connection of the pre-trained U-Net, enabling efficient fine-tuning. Additionally, by extending the functionality of SC-Tuner, it can support controllable image generation with multiple input conditions. SCEdit demonstrates superior performance and efficiency in both text-to-image generation and controllable image synthesis tasks.

SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing

eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers

ECNet: Effective Controllable Text-to-Image Diffusion Models

CCEdit: Creative and Controllable Video Editing via Diffusion Models

SSIE-Diffusion: Personalized Generative Model for Subject-Specific Image Editing

Accelerating Vision Diffusion Transformers with Skip Branches

ControlNet-XS: Designing an Efficient and Effective Architecture for Controlling Text-to-Image Diffusion Models

TextCraftor: Your Text Encoder Can be Image Quality Controller

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models

Semantic-Conditional Diffusion Networks for Image Captioning

Scene Diffusion: Text-driven Scene Image Synthesis Conditioning on a Single 3D Model

Improving Diffusion Models for Scene Text Editing with Dual Encoders

CreativeSynth: Creative Blending and Synthesis of Visual Arts based on Multimodal Diffusion

ControlNet-XS: Rethinking the Control of Text-to-Image Diffusion Models as Feedback-Control Systems

Accelerating Text-to-Image Editing via Cache-Enabled Sparse Diffusion Inference

The Surprising Effectiveness of Skip-Tuning in Diffusion Sampling

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer

CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion