Dynamic Diffusion Transformer

Wangbo Zhao,Yizeng Han,Jiasheng Tang,Kai Wang,Yibing Song,Gao Huang,Fan Wang,Yang You
2024-10-09
Abstract:Diffusion Transformer (DiT), an emerging diffusion model for image generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To address this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions during generation. Specifically, we introduce a Timestep-wise Dynamic Width (TDW) approach that adapts model width conditioned on the generation timesteps. In addition, we design a Spatial-wise Dynamic Token (SDT) strategy to avoid redundant computation at unnecessary spatial locations. Extensive experiments on various datasets and different-sized models verify the superiority of DyDiT. Notably, with <3% additional fine-tuning iterations, our method reduces the FLOPs of DiT-XL by 51%, accelerates generation by 1.73, and achieves a competitive FID score of 2.07 on ImageNet. The code is publicly available at <a class="link-external link-https" href="https://github.com/NUS-HPC-AI-Lab/" rel="external noopener nofollow">this https URL</a> Dynamic-Diffusion-Transformer.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper attempts to address the issue of computational inefficiency in Diffusion Models for image generation tasks. Specifically, existing Diffusion Transformers (DiT) perform well in terms of performance but involve a lot of redundant computations during the inference process, especially across different timesteps and spatial regions. These redundant computations lead to significant computational costs and longer generation times. To tackle this challenge, the authors propose the Dynamic Diffusion Transformer (DyDiT), which improves computational efficiency by dynamically adjusting computational resources across time and space dimensions. The specific methods include: 1. **Timestep-wise Dynamic Width (TDW)**: - Introduces a mechanism that allows the model to dynamically adjust the width of attention and Multi-Layer Perceptron (MLP) blocks based on the current timestep. - By analyzing the loss differences at different timesteps, it is found that the task complexity is lower at later timesteps, which can be handled by a smaller model, thereby reducing redundant computations. 2. **Spatial-wise Dynamic Token (SDT)**: - Designs a strategy to identify image patches where noise prediction is relatively "easy," allowing these patches to skip the computation-intensive MLP blocks, thus reducing unnecessary computations. - By analyzing the loss values of different spatial regions, it is found that the loss values are lower in background regions and higher in main object regions, making uniform computation across all patches inefficient. Through these methods, DyDiT not only significantly reduces the computational load but also accelerates the generation process while maintaining the quality of the generated images. Experimental results show that compared to the static DiT, DyDiT can achieve competitive FID scores while reducing computational load, and its superiority is validated across multiple datasets and different model scales.