DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

Lianghui Zhu,Zilong Huang,Bencheng Liao,Jun Hao Liew,Hanshu Yan,Jiashi Feng,Xinggang Wang

2024-05-29

Abstract:Diffusion models with large-scale pre-training have achieved significant success in the field of visual content generation, particularly exemplified by Diffusion Transformers (DiT). However, DiT models have faced challenges with scalability and quadratic complexity efficiency. In this paper, we aim to leverage the long sequence modeling capability of Gated Linear Attention (GLA) Transformers, expanding its applicability to diffusion models. We introduce Diffusion Gated Linear Attention Transformers (DiG), a simple, adoptable solution with minimal parameter overhead, following the DiT design, but offering superior efficiency and effectiveness. In addition to better performance than DiT, DiG-S/2 exhibits $2.5\times$ higher training speed than DiT-S/2 and saves $75.7\%$ GPU memory at a resolution of $1792 \times 1792$. Moreover, we analyze the scalability of DiG across a variety of computational complexity. DiG models, with increased depth/width or augmentation of input tokens, consistently exhibit decreasing FID. We further compare DiG with other subquadratic-time diffusion models. With the same model size, DiG-XL/2 is $4.2\times$ faster than the recent Mamba-based diffusion model at a $1024$ resolution, and is $1.8\times$ faster than DiT with CUDA-optimized FlashAttention-2 under the $2048$ resolution. All these results demonstrate its superior efficiency among the latest diffusion models. Code is released at <a class="link-external link-https" href="https://github.com/hustvl/DiG" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The main goal of this paper is to propose a new diffusion model architecture—Diffusion Gated Linear Attention Transformers (DiG), to address the scalability and efficiency issues faced by existing diffusion models when handling large-scale pre-training and high-resolution image generation. Specifically, DiG aims to leverage the capabilities of the Gated Linear Attention (GLA) Transformer to improve existing models like Diffusion Transformers (DiT). The GLA Transformer achieves long-sequence modeling through a linear attention mechanism, which helps reduce computational complexity. DiG combines a lightweight Spatial Reorientation and Enhancement Module (SREM) to control inter-layer scanning direction and enhance local perception capabilities. The main contributions of DiG include: 1. Proposing the first attempt to apply linear attention Transformers to diffusion models, addressing the issues of unidirectional scanning and lack of local perception in visual generation by introducing the SREM module. 2. In high-resolution image generation tasks, DiG has higher training speed and lower GPU memory consumption compared to DiT. For example, at a resolution of 1792×1792, DiG is 2.5 times faster than DiT and saves 75.7% of GPU memory. 3. Experimental results show that DiG outperforms DiT in generative performance on the ImageNet dataset, and the FID score continues to decrease as the model scale increases, demonstrating good scalability. In summary, DiG is an efficient and scalable diffusion model, particularly suitable for handling long-sequence generation tasks such as high-resolution image generation. It not only improves training speed and reduces resource requirements but also maintains or enhances the quality of generated images.

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba

Scalable Diffusion Models with Transformers

Dynamic Diffusion Transformer

DiTFastAttn: Attention Compression for Diffusion Transformer Models

TerDiT: Ternary Diffusion Models with Transformers

Accelerating Vision Diffusion Transformers with Skip Branches

$Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers

FasterDiT: Towards Faster Diffusion Transformers Training without Architecture Modification

DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

Scalable Diffusion Models with State Space Backbone

Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures

Scaling Diffusion Mamba with Bidirectional SSMs for Efficient Image and Video Generation

DiffiT: Diffusion Vision Transformers for Image Generation

LaVin-DiT: Large Vision Diffusion Transformer

DiP-GO: A Diffusion Pruner via Few-step Gradient Optimization

SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer

Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers

Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models