Abstract:Diffusion models have emerged as frontrunners in text-to-image generation for their impressive capabilities. Nonetheless, their fixed image resolution during training often leads to challenges in high-resolution image generation, such as semantic inaccuracies and object replication. This paper introduces MegaFusion, a novel approach that extends existing diffusion-based text-to-image generation models towards efficient higher-resolution generation without additional fine-tuning or extra adaptation. Specifically, we employ an innovative truncate and relay strategy to bridge the denoising processes across different resolutions, allowing for high-resolution image generation in a coarse-to-fine manner. Moreover, by integrating dilated convolutions and noise re-scheduling, we further adapt the model's priors for higher resolution. The versatility and efficacy of MegaFusion make it universally applicable to both latent-space and pixel-space diffusion models, along with other derivative models. Extensive experiments confirm that MegaFusion significantly boosts the capability of existing models to produce images of megapixels and various aspect ratios, while only requiring about 40% of the original computational cost.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to extend the existing diffusion models to generate higher - resolution images without additional fine - tuning. Specifically, existing diffusion models usually use a fixed image resolution during the training process, which causes them to face challenges when generating high - resolution images exceeding the training resolution, such as semantic deviation and image quality degradation. The paper proposes a new method named MegaFusion. Through an innovative truncate and relay strategy, as well as combining dilated convolution and noise re - scheduling techniques, it achieves efficient and high - quality high - resolution image generation while supporting the generation of images with arbitrary aspect ratios. ### Main Contributions 1. **Truncate and Relay Strategy**: By seamlessly connecting the generation process between different resolutions, it realizes the step - by - step generation from low - resolution to high - resolution, requiring only about 40% of the original computational cost. 2. **Dilated Convolution and Noise Re - scheduling**: Further enhance the adaptability of pre - trained diffusion models to high - resolution, improving the quality of generated images. 3. **Wide Applicability**: MegaFusion is applicable to diffusion models in latent space and pixel space, as well as other diffusion frameworks with additional conditions, such as IP - Adapter and ControlNet. 4. **Experimental Verification**: Through a large number of experiments, the superiority of MegaFusion in terms of efficiency, image quality and semantic accuracy has been verified. ### Method Overview - **Truncate and Relay Strategy**: - **Truncate**: Perform early denoising at low resolution to ensure accurate semantics. - **Relay**: Upsample the low - resolution image to high - resolution and continue with late - stage denoising to generate texture details. - **Dilated Convolution**: Expand the receptive field of the convolutional layer, enabling the model to capture more global information and reduce semantic deviation. - **Noise Re - scheduling**: Adjust the noise levels at different resolutions to be consistent with the noise level of the original resolution, improving the quality and fidelity of the generated images. ### Experimental Results - **Quantitative Evaluation**: MegaFusion significantly outperforms the baseline models on multiple metrics, including image quality (FID, KID), semantic accuracy (CLIP - T, CIDEr, Meteor, ROUGE) and computational efficiency. - **Human Evaluation**: Through human evaluation, the images generated by MegaFusion have received higher scores in terms of quality and semantic accuracy. In conclusion, MegaFusion provides a method without additional fine - tuning, effectively extends the high - resolution image - generating ability of existing diffusion models, and has broad application prospects.

MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

High-Resolution Image Editing via Multi-Stage Blended Diffusion

HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models

MaxFusion: Plug&Play Multi-Modal Generation in Text-to-Image Diffusion Models

FAM Diffusion: Frequency and Attention Modulation for High-Resolution Image Generation with Stable Diffusion

Zoomed In, Diffused Out: Towards Local Degradation-Aware Multi-Diffusion for Extreme Image Super-Resolution

Efficient image generation with Contour Wavelet Diffusion

Diffusion Models Without Attention

ResMaster: Mastering High-Resolution Image Generation via Structural and Fine-Grained Guidance

Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models and Time-Dependent Layer Normalization

Matryoshka Diffusion Models

DiM: Diffusion Mamba for Efficient High-Resolution Image Synthesis

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

ACDMSR: Accelerated Conditional Diffusion Models for Single Image Super-Resolution

VideoFusion: Decomposed Diffusion Models for High-Quality Video Generation

AccDiffusion v2: Towards More Accurate Higher-Resolution Diffusion Extrapolation

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

BudgetFusion: Perceptually-Guided Adaptive Diffusion Models

Contour wavelet diffusion: A fast and high‐quality image generation model