Abstract:Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba2, RWKV6, Gated Linear Attention, etc, and identify two key features--attention normalization and non-causal inference--that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation, accommodating ultra-resolution images like 16K on a single GPU. Moreover, it is highly compatible with pre-trained SD components and pipelines, such as ControlNet, IP-Adapter, DemoFusion, DistriFusion, etc, requiring no adaptation efforts. Codes are available at <a class="link-external link-https" href="https://github.com/Huage001/LinFusion" rel="external noopener nofollow">this https URL</a>.

PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

DiffusionPipe: Training Large Diffusion Models with Efficient Pipelines

Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference

LinFusion: 1 GPU, 1 Minute, 16K Image

Dynamic Diffusion Transformer

vPipe: A Virtualized Acceleration System for Achieving Efficient and Scalable Pipeline Parallel DNN Training

StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation

Accelerating Vision Diffusion Transformers with Skip Branches

DiG: Scalable and Efficient Diffusion Models with Gated Linear Attention

FreePipe: a Programmable Parallel Rendering Architecture for Efficient Multi-Fragment Effects.

$Δ$-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers

PatchScaler: An Efficient Patch-Independent Diffusion Model for Image Super-Resolution

Patch Diffusion: Faster and More Data-Efficient Training of Diffusion Models

TinyFusion: Diffusion Transformers Learned Shallow

FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion

High-Resolution Frame Interpolation with Patch-based Cascaded Diffusion

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

PipePar: Enabling fast DNN pipeline parallel training in heterogeneous GPU clusters