FlexDiT: Dynamic Token Density Control for Diffusion Transformer

Shuning Chang,Pichao Wang,Jiasheng Tang,Yi Yang
2024-12-09
Abstract:Diffusion Transformers (DiT) deliver impressive generative performance but face prohibitive computational demands due to both the quadratic complexity of token-based self-attention and the need for extensive sampling steps. While recent research has focused on accelerating sampling, the structural inefficiencies of DiT remain underexplored. We propose FlexDiT, a framework that dynamically adapts token density across both spatial and temporal dimensions to achieve computational efficiency without compromising generation quality. Spatially, FlexDiT employs a three-segment architecture that allocates token density based on feature requirements at each layer: Poolingformer in the bottom layers for efficient global feature extraction, Sparse-Dense Token Modules (SDTM) in the middle layers to balance global context with local detail, and dense tokens in the top layers to refine high-frequency details. Temporally, FlexDiT dynamically modulates token density across denoising stages, progressively increasing token count as finer details emerge in later timesteps. This synergy between FlexDiT's spatially adaptive architecture and its temporal pruning strategy enables a unified framework that balances efficiency and fidelity throughout the generation process. Our experiments demonstrate FlexDiT's effectiveness, achieving a 55% reduction in FLOPs and a 175% improvement in inference speed on DiT-XL with only a 0.09 increase in FID score on 512$\times$512 ImageNet images, a 56% reduction in FLOPs across video generation datasets including FaceForensics, SkyTimelapse, UCF101, and Taichi-HD, and a 69% improvement in inference speed on PixArt-$\alpha$ on text-to-image generation task with a 0.24 FID score decrease. FlexDiT provides a scalable solution for high-quality diffusion-based generation compatible with further sampling optimization techniques.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the computational efficiency issue of the Diffusion Transformer (DiT) model in generation tasks. Specifically, although the DiT model performs excellently in generation performance, due to its token - based self - attention mechanism having quadratic complexity and requiring a large number of sampling steps, the computational requirements are very high. This makes the DiT model face significant computational bottlenecks in practical applications, especially in scenarios with high efficiency requirements. To solve this problem, the paper proposes the FlexDiT framework to improve computational efficiency by dynamically adjusting token density without sacrificing generation quality. The following are the specific problems and methods solved by the paper: 1. **High computational complexity**: The DiT model relies on the token - level self - attention mechanism, which brings a computational cost that rises sharply with the increase in the number of tokens and the model scale. FlexDiT alleviates this problem by introducing dynamic token density control in the spatial and temporal dimensions. 2. **Many sampling steps**: The DiT model requires a large number of sampling steps for the denoising process, which further increases the computational burden. FlexDiT not only focuses on accelerating sampling but also deeply explores the inefficiency of the DiT structure itself and proposes a solution that combines the spatial and temporal dimensions. 3. **Imbalance in global and local feature extraction**: In different layers, the DiT model alternately focuses on global and local features, but this alternating pattern has not been fully utilized. FlexDiT optimizes feature extraction at different levels by designing a three - layer architecture (Poolingformer, Sparse - Dense Token Modules, dense tokens). 4. **Token density management in the spatio - temporal dimension**: FlexDiT dynamically adjusts token density in the temporal and spatial dimensions. In the early stage, the model uses fewer tokens to capture low - frequency global structure information; as the denoising process progresses, the number of tokens is gradually increased to capture high - frequency local details. Through these improvements, FlexDiT shows significant performance improvements on multiple datasets, including reducing FLOPs (floating - point operations), increasing inference speed, and maintaining a low FID (Fréchet Inception Distance) score. Specific experimental results show that FlexDiT can significantly reduce the amount of computation while maintaining or even improving the generation quality. In summary, this paper aims to significantly improve the computational efficiency of the DiT model under the premise of ensuring the generation quality through an innovative token density management strategy, thereby expanding its scope of application in practical applications.