Abstract:We present a one-shot text-to-image diffusion model that can generate high-resolution images from natural language descriptions. Our model employs a layered U-Net architecture that simultaneously synthesizes images at multiple resolution scales. We show that this method outperforms the baseline of synthesizing images only at the target resolution, while reducing the computational cost per step. We demonstrate that higher resolution synthesis can be achieved by layering convolutions at additional resolution scales, in contrast to other methods which require additional models for super-resolution synthesis.

What problem does this paper attempt to address?

The paper attempts to address the challenges in high-resolution image generation, particularly in text-to-image synthesis based on diffusion models. Traditional methods often require additional complexity when generating high-resolution images, such as learning in low-dimensional latent spaces or gradually increasing resolution through cascaded models. While effective, these methods add to the model's complexity and computational cost. To overcome these issues, the paper proposes a novel one-shot text-to-image diffusion model capable of generating high-resolution images from natural language descriptions in a single forward pass. Specifically, the main contributions of the paper include: 1. **Multi-Resolution Cascaded Architecture**: The paper introduces a multi-resolution cascaded model based on the U-Net structure, which can generate images at multiple resolution scales simultaneously. This design not only improves image quality but also reduces computational cost per step. 2. **Noise Scaling Technique**: The paper explores the use of sinc interpolation formulas to scale noise at different resolution scales, ensuring effective sharing of noise signals between layers while maintaining their Gaussian characteristics. This helps retain pixel information at different resolutions, thereby enhancing the quality of image synthesis. 3. **Cosine Schedule Offset**: The paper employs an offset cosine schedule to optimize the diffusion steps, allowing the model to synthesize basic image features in the early stages and add finer texture details in the later stages. 4. **Training Optimization**: The paper also explores various training optimization techniques, such as strategic cropping and model stacking, to improve training efficiency and image quality. Through these innovations, the paper aims to provide a lightweight and efficient solution capable of generating high-quality high-resolution images while reducing computational costs.

Layered Diffusion Model for One-Shot High Resolution Text-to-Image Synthesis

Emage: Non-Autoregressive Text-to-Image Generation

DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

Layered Rendering Diffusion Model for Zero-Shot Guided Image Synthesis

SnapFusion: Text-to-Image Diffusion Model on Mobile Devices within Two Seconds

LayerDiff: Exploring Text-guided Multi-layered Composable Image Synthesis via Layer-Collaborative Diffusion Model

Layered Rendering Diffusion Model for Controllable Zero-Shot Image Synthesis

High-Resolution Image Synthesis with Latent Diffusion Models

HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models

UDiffText: A Unified Framework for High-quality Text Synthesis in Arbitrary Images via Character-aware Diffusion Models

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation

Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis

Training-free Diffusion Model Adaptation for Variable-Sized Text-to-Image Synthesis

Text-driven Visual Synthesis with Latent Diffusion Prior

Ultra-High-Resolution Image Synthesis with Pyramid Diffusion Model

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

Cascaded Diffusion Models for High Fidelity Image Generation

Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution

Matryoshka Diffusion Models

UltraPixel: Advancing Ultra-High-Resolution Image Synthesis to New Peaks