LinFusion: 1 GPU, 1 Minute, 16K Image

Songhua Liu,Weihao Yu,Zhenxiong Tan,Xinchao Wang
2024-10-17
Abstract:Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba2, RWKV6, Gated Linear Attention, etc, and identify two key features--attention normalization and non-causal inference--that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation, accommodating ultra-resolution images like 16K on a single GPU. Moreover, it is highly compatible with pre-trained SD components and pipelines, such as ControlNet, IP-Adapter, DemoFusion, DistriFusion, etc, requiring no adaptation efforts. Codes are available at <a class="link-external link-https" href="https://github.com/Huage001/LinFusion" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper attempts to address the challenges faced by modern diffusion models in generating high-resolution visual content. Specifically: 1. **Limitations of Existing Models**: - Modern diffusion models, especially those using Transformer-based UNet for denoising, rely on self-attention mechanisms to handle complex spatial relationships, achieving impressive generative performance. - However, this existing paradigm faces significant challenges when generating high-resolution visual content because the time and memory complexity of self-attention operations scale quadratically with the number of spatial tokens. 2. **Proposed New Method**: - To overcome these limitations, the paper introduces a new linear attention mechanism as an alternative. - The authors draw inspiration from recently proposed models with linear complexity (such as Mamba2, RWKV6, Gated Linear Attention, etc.), identifying two key features—attention normalization and non-causal inference—that enhance the performance of high-resolution visual generation. - Based on these insights, the authors propose a general linear attention paradigm as a low-rank approximation of the widely popular linear token mixers. 3. **Model Training and Knowledge Distillation**: - To save training costs and better utilize pre-trained models, the authors initialize their model and distill knowledge from the pre-trained StableDiffusion (SD). - Experimental results show that the distilled model (referred to as LinFusion) can achieve or exceed the performance of the original SD with only moderate training, while significantly reducing time and memory complexity. 4. **Experimental Validation**: - Extensive experiments conducted on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion can efficiently generate cross-resolution images, including 16K super-resolution images on a single GPU. - Additionally, LinFusion is highly compatible with pre-trained SD components and pipelines (such as ControlNet, IP-Adapter, DemoFusion, DistriFusion, etc.) without requiring additional adaptation efforts. ### Summary The main goal of the paper is to address the time and memory complexity issues of existing diffusion models in generating high-resolution visual content by introducing a new linear attention mechanism. Through this approach, the authors not only improve generative performance but also significantly reduce the demand for computational resources.