Abstract:We address the long-standing problem of how to learn effective pixel-based image diffusion models at scale, introducing a remarkably simple greedy growing method for stable training of large-scale, high-resolution models. without the needs for cascaded super-resolution components. The key insight stems from careful pre-training of core components, namely, those responsible for text-to-image alignment {\it vs.} high-resolution rendering. We first demonstrate the benefits of scaling a {\it Shallow UNet}, with no down(up)-sampling enc(dec)oder. Scaling its deep core layers is shown to improve alignment, object structure, and composition. Building on this core model, we propose a greedy algorithm that grows the architecture into high-resolution end-to-end models, while preserving the integrity of the pre-trained representation, stabilizing training, and reducing the need for large high-resolution datasets. This enables a single stage model capable of generating high-resolution images without the need of a super-resolution cascade. Our key results rely on public datasets and show that we are able to train non-cascaded models up to 8B parameters with no further regularization schemes. Vermeer, our full pipeline model trained with internal datasets to produce 1024x1024 images, without cascades, is preferred by 44.0% vs. 21.4% human evaluators over SDXL.

What problem does this paper attempt to address?

This paper focuses on how to effectively train large-scale pixel-level image diffusion models to generate high-resolution images. Currently, this process has become difficult due to unstable optimization, increased resource demands, and a lack of high-resolution training data. The paper proposes a simple and effective greedy growth method that avoids the need for cascading super-resolution components. First, the researchers demonstrate the scaling advantages of a shallow UNet without upsampling (downsampling) encoders (decoders) by enhancing its deep core layers to improve alignment, object structure, and composition quality. Then, they propose a greedy algorithm that gradually expands the architecture into an end-to-end model at high resolutions while maintaining the integrity of the pretrained representation, stabilizing the training, and reducing reliance on large-scale high-resolution datasets. This approach allows for the generation of high-resolution images using only one stage without cascading models. Experimental results show that this method can train non-cascading models with up to 8 billion parameters on public datasets without the need for additional regularization strategies. The Vermeer model mentioned in the paper achieves higher human preference ratings compared to other models in 1024×1024 image generation. Furthermore, the paper discusses the limitations of cascading models, such as the problem of distribution shift between training and inference, which can amplify unnatural distortions produced by early cascading models. Finally, they highlight the potential applications of their method in image generation and other downstream tasks (such as inverse problems and generation tasks), which often rely on diffusion models as image priors.

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

Noucsr: Efficient Super-Resolution Network Without Upsampling Convolution

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

Single Remote Sensing Image Super-Resolution Via a Generative Adversarial Network with Stratified Dense Sampling and Chain Training

High-Resolution Image Editing via Multi-Stage Blended Diffusion

UltraPixel: Advancing Ultra-High-Resolution Image Synthesis to New Peaks

ResMaster: Mastering High-Resolution Image Generation via Structural and Fine-Grained Guidance

HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models

simple diffusion: End-to-end diffusion for high resolution images

AsConvSR: Fast and Lightweight Super-Resolution Network with Assembled Convolutions

DeeDSR: Towards Real-World Image Super-Resolution via Degradation-Aware Stable Diffusion

Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer

One-step Generative Diffusion for Realistic Extreme Image Rescaling

Improving the Stability and Efficiency of Diffusion Models for Content Consistent Super-Resolution

MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

On the Scalability of Diffusion-based Text-to-Image Generation

ACDMSR: Accelerated Conditional Diffusion Models for Single Image Super-Resolution

Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

Zoomed In, Diffused Out: Towards Local Degradation-Aware Multi-Diffusion for Extreme Image Super-Resolution