DiTFastAttn: Attention Compression for Diffusion Transformer Models

Zhihang Yuan,Hanling Zhang,Pu Lu,Xuefei Ning,Linfeng Zhang,Tianchen Zhao,Shengen Yan,Guohao Dai,Yu Wang
2024-10-18
Abstract:Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the computational bottleneck of DiT. We identify three key redundancies in the attention computation during DiT inference: (1) spatial redundancy, where many attention heads focus on local information; (2) temporal redundancy, with high similarity between the attention outputs of neighboring steps; (3) conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. We propose three techniques to reduce these redundancies: (1) Window Attention with Residual Sharing to reduce spatial redundancy; (2) Attention Sharing across Timesteps to exploit the similarity between steps; (3) Attention Sharing across CFG to skip redundant computations during conditional generation. We apply DiTFastAttn to DiT, PixArt-Sigma for image generation tasks, and OpenSora for video generation tasks. Our results show that for image generation, our method reduces up to 76% of the attention FLOPs and achieves up to 1.8x end-to-end speedup at high-resolution (2k x 2k) generation.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve The paper aims to address the computational bottleneck of Diffusion Transformers (DiT) in image and video generation tasks. Specifically, while DiT models excel in generating high-quality images and videos, they face significant computational challenges when handling high-resolution content due to the quadratic complexity (O(L^2), where L is the length of the input tokens) of the self-attention mechanism. This not only increases computational costs but also limits the model's inference speed. To alleviate this issue, the authors propose **DiTFastAttn**, a post-training compression method that improves the computational efficiency of DiT models by reducing redundancy in self-attention calculations. The paper identifies three main types of redundancy: 1. **Spatial Redundancy**: Many attention heads primarily focus on local information, with attention values for distant tokens being close to zero. 2. **Temporal Redundancy**: Attention outputs between adjacent steps are highly similar. 3. **Conditional Redundancy**: Attention outputs for conditional and unconditional generation exhibit significant similarity in certain heads and steps. To address these redundancies, the authors propose the following three techniques: 1. **Window Attention with Residual Sharing (WA-RS)**: Reduces spatial redundancy by using fixed-size window attention in certain layers and maintains performance by caching and reusing residuals. 2. **Attention Sharing across Timesteps (AST)**: Accelerates attention computation by leveraging the similarity between adjacent steps. 3. **Attention Sharing across CFG (ASC)**: Reduces redundant computation by sharing attention outputs between conditional and unconditional generation. Experimental results show that DiTFastAttn can reduce the FLOPs of attention computation by up to 76% in image generation tasks and achieve up to 1.8x end-to-end acceleration in high-resolution (2k × 2k) generation. Additionally, the method demonstrates significant reductions in computational cost and acceleration in video generation tasks.