MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

Haoning Wu,Shaocheng Shen,Qiang Hu,Xiaoyun Zhang,Ya Zhang,Yanfeng Wang
2024-08-21
Abstract:Diffusion models have emerged as frontrunners in text-to-image generation for their impressive capabilities. Nonetheless, their fixed image resolution during training often leads to challenges in high-resolution image generation, such as semantic inaccuracies and object replication. This paper introduces MegaFusion, a novel approach that extends existing diffusion-based text-to-image generation models towards efficient higher-resolution generation without additional fine-tuning or extra adaptation. Specifically, we employ an innovative truncate and relay strategy to bridge the denoising processes across different resolutions, allowing for high-resolution image generation in a coarse-to-fine manner. Moreover, by integrating dilated convolutions and noise re-scheduling, we further adapt the model's priors for higher resolution. The versatility and efficacy of MegaFusion make it universally applicable to both latent-space and pixel-space diffusion models, along with other derivative models. Extensive experiments confirm that MegaFusion significantly boosts the capability of existing models to produce images of megapixels and various aspect ratios, while only requiring about 40% of the original computational cost.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to extend the existing diffusion models to generate higher - resolution images without additional fine - tuning. Specifically, existing diffusion models usually use a fixed image resolution during the training process, which causes them to face challenges when generating high - resolution images exceeding the training resolution, such as semantic deviation and image quality degradation. The paper proposes a new method named MegaFusion. Through an innovative truncate and relay strategy, as well as combining dilated convolution and noise re - scheduling techniques, it achieves efficient and high - quality high - resolution image generation while supporting the generation of images with arbitrary aspect ratios. ### Main Contributions 1. **Truncate and Relay Strategy**: By seamlessly connecting the generation process between different resolutions, it realizes the step - by - step generation from low - resolution to high - resolution, requiring only about 40% of the original computational cost. 2. **Dilated Convolution and Noise Re - scheduling**: Further enhance the adaptability of pre - trained diffusion models to high - resolution, improving the quality of generated images. 3. **Wide Applicability**: MegaFusion is applicable to diffusion models in latent space and pixel space, as well as other diffusion frameworks with additional conditions, such as IP - Adapter and ControlNet. 4. **Experimental Verification**: Through a large number of experiments, the superiority of MegaFusion in terms of efficiency, image quality and semantic accuracy has been verified. ### Method Overview - **Truncate and Relay Strategy**: - **Truncate**: Perform early denoising at low resolution to ensure accurate semantics. - **Relay**: Upsample the low - resolution image to high - resolution and continue with late - stage denoising to generate texture details. - **Dilated Convolution**: Expand the receptive field of the convolutional layer, enabling the model to capture more global information and reduce semantic deviation. - **Noise Re - scheduling**: Adjust the noise levels at different resolutions to be consistent with the noise level of the original resolution, improving the quality and fidelity of the generated images. ### Experimental Results - **Quantitative Evaluation**: MegaFusion significantly outperforms the baseline models on multiple metrics, including image quality (FID, KID), semantic accuracy (CLIP - T, CIDEr, Meteor, ROUGE) and computational efficiency. - **Human Evaluation**: Through human evaluation, the images generated by MegaFusion have received higher scores in terms of quality and semantic accuracy. In conclusion, MegaFusion provides a method without additional fine - tuning, effectively extends the high - resolution image - generating ability of existing diffusion models, and has broad application prospects.