Abstract:Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba2, RWKV6, Gated Linear Attention, etc, and identify two key features--attention normalization and non-causal inference--that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation, accommodating ultra-resolution images like 16K on a single GPU. Moreover, it is highly compatible with pre-trained SD components and pipelines, such as ControlNet, IP-Adapter, DemoFusion, DistriFusion, etc, requiring no adaptation efforts. Codes are available at <a class="link-external link-https" href="https://github.com/Huage001/LinFusion" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper attempts to address the challenges faced by modern diffusion models in generating high-resolution visual content. Specifically: 1. **Limitations of Existing Models**: - Modern diffusion models, especially those using Transformer-based UNet for denoising, rely on self-attention mechanisms to handle complex spatial relationships, achieving impressive generative performance. - However, this existing paradigm faces significant challenges when generating high-resolution visual content because the time and memory complexity of self-attention operations scale quadratically with the number of spatial tokens. 2. **Proposed New Method**: - To overcome these limitations, the paper introduces a new linear attention mechanism as an alternative. - The authors draw inspiration from recently proposed models with linear complexity (such as Mamba2, RWKV6, Gated Linear Attention, etc.), identifying two key features—attention normalization and non-causal inference—that enhance the performance of high-resolution visual generation. - Based on these insights, the authors propose a general linear attention paradigm as a low-rank approximation of the widely popular linear token mixers. 3. **Model Training and Knowledge Distillation**: - To save training costs and better utilize pre-trained models, the authors initialize their model and distill knowledge from the pre-trained StableDiffusion (SD). - Experimental results show that the distilled model (referred to as LinFusion) can achieve or exceed the performance of the original SD with only moderate training, while significantly reducing time and memory complexity. 4. **Experimental Validation**: - Extensive experiments conducted on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion can efficiently generate cross-resolution images, including 16K super-resolution images on a single GPU. - Additionally, LinFusion is highly compatible with pre-trained SD components and pipelines (such as ControlNet, IP-Adapter, DemoFusion, DistriFusion, etc.) without requiring additional adaptation efforts. ### Summary The main goal of the paper is to address the time and memory complexity issues of existing diffusion models in generating high-resolution visual content by introducing a new linear attention mechanism. Through this approach, the authors not only improve generative performance but also significantly reduce the demand for computational resources.

LinFusion: 1 GPU, 1 Minute, 16K Image

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

UniFL: Improve Latent Diffusion Model via Unified Feedback Learning

LinGen: Towards High-Resolution Minute-Length Text-to-Video Generation with Linear Computational Complexity

Flexiffusion: Segment-wise Neural Architecture Search for Flexible Denoising Schedule

Lfdt-Fusion: A Latent Feature-Guided Diffusion Transformer Model for General Image Fusion

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

LaMamba-Diff: Linear-Time High-Fidelity Diffusion Models Based on Local Attention and Mamba

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference

Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers

One Diffusion to Generate Them All

Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization

SinFusion: Training Diffusion Models on a Single Image or Video

MMDRFuse: Distilled Mini-Model with Dynamic Refresh for Multi-Modality Image Fusion

Fusion-Based Low-Light Image Enhancement

Diffusion Models Without Attention

FLFusionSR: a Fast and Lightweight Fusion and Super-Resolution Network for Infrared and Visible Images on Edge Devices

Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT