Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

Cristina N. Vasconcelos,Abdullah Rashwan Austin Waters,Trevor Walker,Keyang Xu,Jimmy Yan,Rui Qian,Shixin Luo,Zarana Parekh,Andrew Bunner,Hongliang Fei,Roopal Garg,Mandy Guo,Ivana Kajic,Yeqing Li,Henna Nandwani,Jordi Pont-Tuset,Yasumasa Onoe,Sarah Rosston,Su Wang,Wenlei Zhou,Kevin Swersky,David J. Fleet,Jason M. Baldridge,Oliver Wang
2024-05-27
Abstract:We address the long-standing problem of how to learn effective pixel-based image diffusion models at scale, introducing a remarkably simple greedy growing method for stable training of large-scale, high-resolution models. without the needs for cascaded super-resolution components. The key insight stems from careful pre-training of core components, namely, those responsible for text-to-image alignment {\it vs.} high-resolution rendering. We first demonstrate the benefits of scaling a {\it Shallow UNet}, with no down(up)-sampling enc(dec)oder. Scaling its deep core layers is shown to improve alignment, object structure, and composition. Building on this core model, we propose a greedy algorithm that grows the architecture into high-resolution end-to-end models, while preserving the integrity of the pre-trained representation, stabilizing training, and reducing the need for large high-resolution datasets. This enables a single stage model capable of generating high-resolution images without the need of a super-resolution cascade. Our key results rely on public datasets and show that we are able to train non-cascaded models up to 8B parameters with no further regularization schemes. Vermeer, our full pipeline model trained with internal datasets to produce 1024x1024 images, without cascades, is preferred by 44.0% vs. 21.4% human evaluators over SDXL.
Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
This paper focuses on how to effectively train large-scale pixel-level image diffusion models to generate high-resolution images. Currently, this process has become difficult due to unstable optimization, increased resource demands, and a lack of high-resolution training data. The paper proposes a simple and effective greedy growth method that avoids the need for cascading super-resolution components. First, the researchers demonstrate the scaling advantages of a shallow UNet without upsampling (downsampling) encoders (decoders) by enhancing its deep core layers to improve alignment, object structure, and composition quality. Then, they propose a greedy algorithm that gradually expands the architecture into an end-to-end model at high resolutions while maintaining the integrity of the pretrained representation, stabilizing the training, and reducing reliance on large-scale high-resolution datasets. This approach allows for the generation of high-resolution images using only one stage without cascading models. Experimental results show that this method can train non-cascading models with up to 8 billion parameters on public datasets without the need for additional regularization strategies. The Vermeer model mentioned in the paper achieves higher human preference ratings compared to other models in 1024×1024 image generation. Furthermore, the paper discusses the limitations of cascading models, such as the problem of distribution shift between training and inference, which can amplify unnatural distortions produced by early cascading models. Finally, they highlight the potential applications of their method in image generation and other downstream tasks (such as inverse problems and generation tasks), which often rely on diffusion models as image priors.