Abstract:While image generation with diffusion models has achieved a great success, generating images of higher resolution than the training size remains a challenging task due to the high computational cost. Current methods typically perform the entire sampling process at full resolution and process all frequency components simultaneously, contradicting with the inherent coarse-to-fine nature of latent diffusion models and wasting computations on processing premature high-frequency details at early diffusion stages. To address this issue, we introduce an efficient $\textbf{Fre}$quency-aware $\textbf{Ca}$scaded $\textbf{S}$ampling framework, $\textbf{FreCaS}$ in short, for higher-resolution image generation. FreCaS decomposes the sampling process into cascaded stages with gradually increased resolutions, progressively expanding frequency bands and refining the corresponding details. We propose an innovative frequency-aware classifier-free guidance (FA-CFG) strategy to assign different guidance strengths for different frequency components, directing the diffusion model to add new details in the expanded frequency domain of each stage. Additionally, we fuse the cross-attention maps of previous and current stages to avoid synthesizing unfaithful layouts. Experiments demonstrate that FreCaS significantly outperforms state-of-the-art methods in image quality and generation speed. In particular, FreCaS is about 2.86$\times$ and 6.07$\times$ faster than ScaleCrafter and DemoFusion in generating a 2048$\times$2048 image using a pre-trained SDXL model and achieves an FID$_b$ improvement of 11.6 and 3.7, respectively. FreCaS can be easily extended to more complex models such as SD3. The source code of FreCaS can be found at $\href{\text{<a class="link-external link-https" href="https://github.com/xtudbxk/FreCaS" rel="external noopener nofollow">this https URL</a>}}{<a class="link-external link-https" href="https://github.com/xtudbxk/FreCaS" rel="external noopener nofollow">this https URL</a>}$.

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

Upsample Guidance: Scale Up Diffusion Models without Training

Cascaded Diffusion Models for High Fidelity Image Generation

DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models

Towards Precise Scaling Laws for Video Diffusion Transformers

FreCaS: Efficient Higher-Resolution Image Generation via Frequency-aware Cascaded Sampling

One-step Generative Diffusion for Realistic Extreme Image Rescaling

Boosting Latent Diffusion with Flow Matching

FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis

FineDiffusion: Scaling up Diffusion Models for Fine-grained Image Generation with 10,000 Classes

MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

Training-free Diffusion Model Adaptation for Variable-Sized Text-to-Image Synthesis

High-Resolution Image Editing via Multi-Stage Blended Diffusion

ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

ACDMSR: Accelerated Conditional Diffusion Models for Single Image Super-Resolution

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers