Abstract:Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the computational bottleneck of DiT. We identify three key redundancies in the attention computation during DiT inference: (1) spatial redundancy, where many attention heads focus on local information; (2) temporal redundancy, with high similarity between the attention outputs of neighboring steps; (3) conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. We propose three techniques to reduce these redundancies: (1) Window Attention with Residual Sharing to reduce spatial redundancy; (2) Attention Sharing across Timesteps to exploit the similarity between steps; (3) Attention Sharing across CFG to skip redundant computations during conditional generation. We apply DiTFastAttn to DiT, PixArt-Sigma for image generation tasks, and OpenSora for video generation tasks. Our results show that for image generation, our method reduces up to 76% of the attention FLOPs and achieves up to 1.8x end-to-end speedup at high-resolution (2k x 2k) generation.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve The paper aims to address the computational bottleneck of Diffusion Transformers (DiT) in image and video generation tasks. Specifically, while DiT models excel in generating high-quality images and videos, they face significant computational challenges when handling high-resolution content due to the quadratic complexity (O(L^2), where L is the length of the input tokens) of the self-attention mechanism. This not only increases computational costs but also limits the model's inference speed. To alleviate this issue, the authors propose **DiTFastAttn**, a post-training compression method that improves the computational efficiency of DiT models by reducing redundancy in self-attention calculations. The paper identifies three main types of redundancy: 1. **Spatial Redundancy**: Many attention heads primarily focus on local information, with attention values for distant tokens being close to zero. 2. **Temporal Redundancy**: Attention outputs between adjacent steps are highly similar. 3. **Conditional Redundancy**: Attention outputs for conditional and unconditional generation exhibit significant similarity in certain heads and steps. To address these redundancies, the authors propose the following three techniques: 1. **Window Attention with Residual Sharing (WA-RS)**: Reduces spatial redundancy by using fixed-size window attention in certain layers and maintains performance by caching and reusing residuals. 2. **Attention Sharing across Timesteps (AST)**: Accelerates attention computation by leveraging the similarity between adjacent steps. 3. **Attention Sharing across CFG (ASC)**: Reduces redundant computation by sharing attention outputs between conditional and unconditional generation. Experimental results show that DiTFastAttn can reduce the FLOPs of attention computation by up to 76% in image generation tasks and achieve up to 1.8x end-to-end acceleration in high-resolution (2k × 2k) generation. Additionally, the method demonstrates significant reductions in computational cost and acceleration in video generation tasks.

DiTFastAttn: Attention Compression for Diffusion Transformer Models

Multi-Dimension Compression of Feed-Forward Network in Vision Transformers

FasterDiT: Towards Faster Diffusion Transformers Training without Architecture Modification

Dynamic Diffusion Transformer

$Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers

Accelerating Vision Diffusion Transformers with Skip Branches

FlexDiT: Dynamic Token Density Control for Diffusion Transformer

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

Efficient Diffusion Transformer with Step-wise Dynamic Attention Mediators

Qihoo-T2X: An Efficient Proxy-Tokenized Diffusion Transformer for Text-to-Any-Task

TerDiT: Ternary Diffusion Models with Transformers

FORA: Fast-Forward Caching in Diffusion Transformer Acceleration

SmoothCache: A Universal Inference Acceleration Technique for Diffusion Transformers

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation

EDT: An Efficient Diffusion Transformer Framework Inspired by Human-like Sketching

Adaptive Caching for Faster Video Generation with Diffusion Transformers

DiT4Edit: Diffusion Transformer for Image Editing

Faster Diffusion via Temporal Attention Decomposition

VQ4DiT: Efficient Post-Training Vector Quantization for Diffusion Transformers

TaQ-DiT: Time-aware Quantization for Diffusion Transformers

DiffiT: Diffusion Vision Transformers for Image Generation