Abstract:Modern diffusion models, particularly those utilizing a Transformer-based UNet for denoising, rely heavily on self-attention operations to manage complex spatial relationships, thus achieving impressive generation performance. However, this existing paradigm faces significant challenges in generating high-resolution visual content due to its quadratic time and memory complexity with respect to the number of spatial tokens. To address this limitation, we aim at a novel linear attention mechanism as an alternative in this paper. Specifically, we begin our exploration from recently introduced models with linear complexity, e.g., Mamba2, RWKV6, Gated Linear Attention, etc, and identify two key features--attention normalization and non-causal inference--that enhance high-resolution visual generation performance. Building on these insights, we introduce a generalized linear attention paradigm, which serves as a low-rank approximation of a wide spectrum of popular linear token mixers. To save the training cost and better leverage pre-trained models, we initialize our models and distill the knowledge from pre-trained StableDiffusion (SD). We find that the distilled model, termed LinFusion, achieves performance on par with or superior to the original SD after only modest training, while significantly reducing time and memory complexity. Extensive experiments on SD-v1.5, SD-v2.1, and SD-XL demonstrate that LinFusion enables satisfactory and efficient zero-shot cross-resolution generation, accommodating ultra-resolution images like 16K on a single GPU. Moreover, it is highly compatible with pre-trained SD components and pipelines, such as ControlNet, IP-Adapter, DemoFusion, DistriFusion, etc, requiring no adaptation efforts. Codes are available at <a class="link-external link-https" href="https://github.com/Huage001/LinFusion" rel="external noopener nofollow">this https URL</a>.

Flexiffusion: Segment-wise Neural Architecture Search for Flexible Denoising Schedule

Lightweight Diffusion Models with Distillation-Based Block Neural Architecture Search

AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration

Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference

FRDiff : Feature Reuse for Universal Training-free Acceleration of Diffusion Models

AdaDiff: Adaptive Step Selection for Fast Diffusion.

DiffNAS: Bootstrapping Diffusion Models by Prompting for Better Architectures

DiffNAS: Bootstrapping Diffusion Models by Prompting for Better Architectures

Improving Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architectures

AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising

Not All Steps are Equal: Efficient Generation with Progressive Diffusion Models

Factorized Diffusion Architectures for Unsupervised Image Generation and Segmentation

LinFusion: 1 GPU, 1 Minute, 16K Image

DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models

Denoising Diffusion Step-aware Models

Optimal Linear Subspace Search: Learning to Construct Fast and High-Quality Schedulers for Diffusion Models.

SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model

BudgetFusion: Perceptually-Guided Adaptive Diffusion Models

Training-Free Adaptive Diffusion with Bounded Difference Approximation Strategy

Efficient Diffusion Training Via Min-SNR Weighting Strategy.

Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference