Layered Diffusion Model for One-Shot High Resolution Text-to-Image Synthesis

Emaad Khwaja,Abdullah Rashwan,Ting Chen,Oliver Wang,Suraj Kothawade,Yeqing Li
2024-07-09
Abstract:We present a one-shot text-to-image diffusion model that can generate high-resolution images from natural language descriptions. Our model employs a layered U-Net architecture that simultaneously synthesizes images at multiple resolution scales. We show that this method outperforms the baseline of synthesizing images only at the target resolution, while reducing the computational cost per step. We demonstrate that higher resolution synthesis can be achieved by layering convolutions at additional resolution scales, in contrast to other methods which require additional models for super-resolution synthesis.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The paper attempts to address the challenges in high-resolution image generation, particularly in text-to-image synthesis based on diffusion models. Traditional methods often require additional complexity when generating high-resolution images, such as learning in low-dimensional latent spaces or gradually increasing resolution through cascaded models. While effective, these methods add to the model's complexity and computational cost. To overcome these issues, the paper proposes a novel one-shot text-to-image diffusion model capable of generating high-resolution images from natural language descriptions in a single forward pass. Specifically, the main contributions of the paper include: 1. **Multi-Resolution Cascaded Architecture**: The paper introduces a multi-resolution cascaded model based on the U-Net structure, which can generate images at multiple resolution scales simultaneously. This design not only improves image quality but also reduces computational cost per step. 2. **Noise Scaling Technique**: The paper explores the use of sinc interpolation formulas to scale noise at different resolution scales, ensuring effective sharing of noise signals between layers while maintaining their Gaussian characteristics. This helps retain pixel information at different resolutions, thereby enhancing the quality of image synthesis. 3. **Cosine Schedule Offset**: The paper employs an offset cosine schedule to optimize the diffusion steps, allowing the model to synthesize basic image features in the early stages and add finer texture details in the later stages. 4. **Training Optimization**: The paper also explores various training optimization techniques, such as strategic cropping and model stacking, to improve training efficiency and image quality. Through these innovations, the paper aims to provide a lightweight and efficient solution capable of generating high-quality high-resolution images while reducing computational costs.