Abstract:Diffusion models are trained by learning a sequence of models that reverse each step of noise corruption. Typically, the model parameters are fully shared across multiple timesteps to enhance training efficiency. However, since the denoising tasks differ at each timestep, the gradients computed at different timesteps may conflict, potentially degrading the overall performance of image generation. To solve this issue, this work proposes a Decouple-then-Merge (DeMe) framework, which begins with a pretrained model and finetunes separate models tailored to specific timesteps. We introduce several improved techniques during the finetuning stage to promote effective knowledge sharing while minimizing training interference across timesteps. Finally, after finetuning, these separate models can be merged into a single model in the parameter space, ensuring efficient and practical inference. Experimental results show significant generation quality improvements upon 6 benchmarks including Stable Diffusion on COCO30K, ImageNet1K, PartiPrompts, and DDPM on LSUN Church, LSUN Bedroom, and CIFAR10.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve a key problem encountered in the training process of diffusion models: **gradient conflicts between denoising tasks at different time steps**. Specifically: 1. **Problem background**: - Diffusion models generate images by learning a series of models to reverse each noise - polluted step. - Usually, model parameters are shared among multiple time steps to improve training efficiency. - However, since the denoising tasks at each time step are different, the gradients calculated at different time steps may conflict, which may reduce the overall quality of image generation. 2. **Specific problems**: - **Gradient conflict**: The gradients at different time steps are quite different, especially between non - adjacent time steps, indicating that there are conflicts in the optimization directions (as shown in Figure 1(a)). - **Performance degradation**: Such gradient conflicts may cause the model to be difficult to converge during training, thereby affecting the quality of the generated images. 3. **Solutions**: - A framework named **Decouple - then - Merge (DeMe)** is proposed to solve the above problems. - **Decoupling stage**: First, fine - tune the pre - trained model and train multiple models for different time step ranges respectively to avoid gradient conflicts between different time steps. - **Merging stage**: Then merge these fine - tuned models into a single model in the parameter space to ensure efficient inference and knowledge sharing. 4. **Innovations**: - **Effectively avoid gradient conflicts**: By decoupling the training at different time steps, gradient conflicts are reduced and the generation quality is improved. - **No additional overhead**: The merged model has no additional computational, storage or memory access costs during inference. - **Significantly improve generation quality**: Experimental results show that on multiple benchmark datasets, the DeMe framework significantly improves the quality of the generated images. Through this method, this paper successfully solves the gradient conflict problem in diffusion model training and significantly improves the quality of image generation.

Decouple-Then-Merge: Towards Better Training for Diffusion Models

Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures

FRDiff : Feature Reuse for Universal Training-free Acceleration of Diffusion Models

Immiscible Diffusion: Accelerating Diffusion Training with Noise Assignment

Training Diffusion Models with Federated Learning

ReDiFine: Reusable Diffusion Finetuning for Mitigating Degradation in the Chain of Diffusion

Residual Denoising Diffusion Models

Collaborative Diffusion for Multi-Modal Face Generation and Editing

Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference

Diffusion Tuning: Transferring Diffusion Models via Chain of Forgetting

Stimulating the Diffusion Model for Image Denoising Via Adaptive Embedding and Ensembling

Stimulating Diffusion Model for Image Denoising via Adaptive Embedding and Ensembling

High-Fidelity Diffusion-based Image Editing

Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models

DDRF: Denoising Diffusion Model for Remote Sensing Image Fusion

DuoDiff: Accelerating Diffusion Models with a Dual-Backbone Approach

One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls

Towards Faster Training of Diffusion Models: An Inspiration of A Consistency Phenomenon

Towards More Accurate Diffusion Model Acceleration with A Timestep Aligner

PartDiff: Image Super-resolution with Partial Diffusion Models