Abstract:Diffusion Transformers (DiT) deliver impressive generative performance but face prohibitive computational demands due to both the quadratic complexity of token-based self-attention and the need for extensive sampling steps. While recent research has focused on accelerating sampling, the structural inefficiencies of DiT remain underexplored. We propose FlexDiT, a framework that dynamically adapts token density across both spatial and temporal dimensions to achieve computational efficiency without compromising generation quality. Spatially, FlexDiT employs a three-segment architecture that allocates token density based on feature requirements at each layer: Poolingformer in the bottom layers for efficient global feature extraction, Sparse-Dense Token Modules (SDTM) in the middle layers to balance global context with local detail, and dense tokens in the top layers to refine high-frequency details. Temporally, FlexDiT dynamically modulates token density across denoising stages, progressively increasing token count as finer details emerge in later timesteps. This synergy between FlexDiT's spatially adaptive architecture and its temporal pruning strategy enables a unified framework that balances efficiency and fidelity throughout the generation process. Our experiments demonstrate FlexDiT's effectiveness, achieving a 55% reduction in FLOPs and a 175% improvement in inference speed on DiT-XL with only a 0.09 increase in FID score on 512$\times$512 ImageNet images, a 56% reduction in FLOPs across video generation datasets including FaceForensics, SkyTimelapse, UCF101, and Taichi-HD, and a 69% improvement in inference speed on PixArt-$\alpha$ on text-to-image generation task with a 0.24 FID score decrease. FlexDiT provides a scalable solution for high-quality diffusion-based generation compatible with further sampling optimization techniques.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the computational efficiency issue of the Diffusion Transformer (DiT) model in generation tasks. Specifically, although the DiT model performs excellently in generation performance, due to its token - based self - attention mechanism having quadratic complexity and requiring a large number of sampling steps, the computational requirements are very high. This makes the DiT model face significant computational bottlenecks in practical applications, especially in scenarios with high efficiency requirements. To solve this problem, the paper proposes the FlexDiT framework to improve computational efficiency by dynamically adjusting token density without sacrificing generation quality. The following are the specific problems and methods solved by the paper: 1. **High computational complexity**: The DiT model relies on the token - level self - attention mechanism, which brings a computational cost that rises sharply with the increase in the number of tokens and the model scale. FlexDiT alleviates this problem by introducing dynamic token density control in the spatial and temporal dimensions. 2. **Many sampling steps**: The DiT model requires a large number of sampling steps for the denoising process, which further increases the computational burden. FlexDiT not only focuses on accelerating sampling but also deeply explores the inefficiency of the DiT structure itself and proposes a solution that combines the spatial and temporal dimensions. 3. **Imbalance in global and local feature extraction**: In different layers, the DiT model alternately focuses on global and local features, but this alternating pattern has not been fully utilized. FlexDiT optimizes feature extraction at different levels by designing a three - layer architecture (Poolingformer, Sparse - Dense Token Modules, dense tokens). 4. **Token density management in the spatio - temporal dimension**: FlexDiT dynamically adjusts token density in the temporal and spatial dimensions. In the early stage, the model uses fewer tokens to capture low - frequency global structure information; as the denoising process progresses, the number of tokens is gradually increased to capture high - frequency local details. Through these improvements, FlexDiT shows significant performance improvements on multiple datasets, including reducing FLOPs (floating - point operations), increasing inference speed, and maintaining a low FID (Fréchet Inception Distance) score. Specific experimental results show that FlexDiT can significantly reduce the amount of computation while maintaining or even improving the generation quality. In summary, this paper aims to significantly improve the computational efficiency of the DiT model under the premise of ensuring the generation quality through an innovative token density management strategy, thereby expanding its scope of application in practical applications.

FlexDiT: Dynamic Token Density Control for Diffusion Transformer

Dynamic Diffusion Transformer

DiTFastAttn: Attention Compression for Diffusion Transformer Models

Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators

Accelerating Diffusion Transformers with Token-wise Feature Caching

SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer

Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task

Accelerating Vision Diffusion Transformers with Skip Branches

U-DiTs: Downsample Tokens in U-Shaped Diffusion Transformers

$Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers

DiffiT: Diffusion Vision Transformers for Image Generation

EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching

Token Caching for Diffusion Transformer Acceleration

FasterDiT: Towards Faster Diffusion Transformers Training without Architecture Modification

SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

DiT4Edit: Diffusion Transformer for Image Editing

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

MDTv2: Masked Diffusion Transformer is a Strong Image Synthesizer

DiT: Efficient Vision Transformers with Dynamic Token Routing

LinFusion: 1 GPU, 1 Minute, 16K Image

AV-DiT: Efficient Audio-Visual Diffusion Transformer for Joint Audio and Video Generation