Abstract:Diffusion Transformer (DiT), an emerging diffusion model for image generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To address this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. Extensive experiments on various datasets and different-sized models verify the superiority of DyDiT. Notably, with <3% additional fine-tuning iterations, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet. The code is publicly available at <a class="link-external link-https" href="https://github.com/NUS-HPC-AI-Lab/" rel="external noopener nofollow">this https URL</a> Dynamic-Diffusion-Transformer.

What problem does this paper attempt to address?

The paper attempts to address the issue of computational inefficiency in Diffusion Models for image generation tasks. Specifically, existing Diffusion Transformers (DiT) perform well in terms of performance but involve a lot of redundant computations during the inference process, especially across different timesteps and spatial regions. These redundant computations lead to significant computational costs and longer generation times. To tackle this challenge, the authors propose the Dynamic Diffusion Transformer (DyDiT), which improves computational efficiency by dynamically adjusting computational resources across time and space dimensions. The specific methods include: 1. **Timestep-wise Dynamic Width (TDW)**: - Introduces a mechanism that allows the model to dynamically adjust the width of attention and Multi-Layer Perceptron (MLP) blocks based on the current timestep. - By analyzing the loss differences at different timesteps, it is found that the task complexity is lower at later timesteps, which can be handled by a smaller model, thereby reducing redundant computations. 2. **Spatial-wise Dynamic Token (SDT)**: - Designs a strategy to identify image patches where noise prediction is relatively "easy," allowing these patches to skip the computation-intensive MLP blocks, thus reducing unnecessary computations. - By analyzing the loss values of different spatial regions, it is found that the loss values are lower in background regions and higher in main object regions, making uniform computation across all patches inefficient. Through these methods, DyDiT not only significantly reduces the computational load but also accelerates the generation process while maintaining the quality of the generated images. Experimental results show that compared to the static DiT, DyDiT can achieve competitive FID scores while reducing computational load, and its superiority is validated across multiple datasets and different model scales.

Dynamic Diffusion Transformer

FlexDiT: Dynamic Token Density Control for Diffusion Transformer

TerDiT: Ternary Diffusion Models with Transformers

Accelerating Vision Diffusion Transformers with Skip Branches

DiTFastAttn: Attention Compression for Diffusion Transformer Models

DiffiT: Diffusion Vision Transformers for Image Generation

$Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers

Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers

LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers

Scalable Diffusion Models with Transformers

EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching

FasterDiT: Towards Faster Diffusion Transformers Training without Architecture Modification

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

Scaling Diffusion Transformers to 16 Billion Parameters

Effective Diffusion Transformer Architecture for Image Super-Resolution

SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer

AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration

DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

DiT4Edit: Diffusion Transformer for Image Editing

Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators

DiffI2I: Efficient Diffusion Model for Image-to-Image Translation