DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

Lianghui Zhu,Zilong Huang,Bencheng Liao,Jun Hao Liew,Hanshu Yan,Jiashi Feng,Xinggang Wang
2024-05-29
Abstract:Diffusion models with large-scale pre-training have achieved significant success in the field of visual content generation, particularly exemplified by Diffusion Transformers (DiT). However, DiT models have faced challenges with scalability and quadratic complexity efficiency. In this paper, we aim to leverage the long sequence modeling capability of Gated Linear Attention (GLA) Transformers, expanding its applicability to diffusion models. We introduce Diffusion Gated Linear Attention Transformers (DiG), a simple, adoptable solution with minimal parameter overhead, following the DiT design, but offering superior efficiency and effectiveness. In addition to better performance than DiT, DiG-S/2 exhibits $2.5\times$ higher training speed than DiT-S/2 and saves $75.7\%$ GPU memory at a resolution of $1792 \times 1792$. Moreover, we analyze the scalability of DiG across a variety of computational complexity. DiG models, with increased depth/width or augmentation of input tokens, consistently exhibit decreasing FID. We further compare DiG with other subquadratic-time diffusion models. With the same model size, DiG-XL/2 is $4.2\times$ faster than the recent Mamba-based diffusion model at a $1024$ resolution, and is $1.8\times$ faster than DiT with CUDA-optimized FlashAttention-2 under the $2048$ resolution. All these results demonstrate its superior efficiency among the latest diffusion models. Code is released at <a class="link-external link-https" href="https://github.com/hustvl/DiG" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The main goal of this paper is to propose a new diffusion model architecture—Diffusion Gated Linear Attention Transformers (DiG), to address the scalability and efficiency issues faced by existing diffusion models when handling large-scale pre-training and high-resolution image generation. Specifically, DiG aims to leverage the capabilities of the Gated Linear Attention (GLA) Transformer to improve existing models like Diffusion Transformers (DiT). The GLA Transformer achieves long-sequence modeling through a linear attention mechanism, which helps reduce computational complexity. DiG combines a lightweight Spatial Reorientation and Enhancement Module (SREM) to control inter-layer scanning direction and enhance local perception capabilities. The main contributions of DiG include: 1. Proposing the first attempt to apply linear attention Transformers to diffusion models, addressing the issues of unidirectional scanning and lack of local perception in visual generation by introducing the SREM module. 2. In high-resolution image generation tasks, DiG has higher training speed and lower GPU memory consumption compared to DiT. For example, at a resolution of 1792×1792, DiG is 2.5 times faster than DiT and saves 75.7% of GPU memory. 3. Experimental results show that DiG outperforms DiT in generative performance on the ImageNet dataset, and the FID score continues to decrease as the model scale increases, demonstrating good scalability. In summary, DiG is an efficient and scalable diffusion model, particularly suitable for handling long-sequence generation tasks such as high-resolution image generation. It not only improves training speed and reduces resource requirements but also maintains or enhances the quality of generated images.