Abstract:We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. $1024 \times 1024$) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet $256^2$, and sets a new state-of-the-art for diffusion models on FFHQ-$1024^2$.

What problem does this paper attempt to address?

The paper primarily aims to address the efficiency and quality issues in high-resolution image generation, particularly the challenges faced when using diffusion models for pixel-space image synthesis. Specifically, the research team proposed the Hourglass Diffusion Transformer (HDiT), a novel pure Transformer architecture designed to achieve efficient and high-quality large-scale image generation. ### Research Objectives 1. **Develop an efficient Transformer architecture**: To overcome the problem of rapidly increasing computational costs when processing high-resolution images with traditional diffusion models, the paper proposes a new Transformer architecture—Hourglass Diffusion Transformer (HDiT), which can handle the growth in image size with linear complexity (instead of the traditional quadratic complexity). 2. **Achieve high-quality image generation**: By introducing a series of architectural improvements, this model can generate high-quality high-resolution images directly in pixel space without sacrificing image quality. ### Main Contributions 1. **Explore how to adapt the Transformer architecture**: The paper discusses in detail how to adjust Transformer-based diffusion models to efficiently generate high-quality images in pixel space. 2. **Introduce the HDiT architecture**: A new architecture named Hourglass Diffusion Transformer is proposed, which supports sub-quadratic computational cost growth and maintains good performance even at high resolutions. 3. **Demonstrate the model's scalability**: Experiments show that this architecture can scale to high resolutions such as 1024x1024 without requiring high-resolution-specific training techniques and remains competitive at lower resolutions. ### Method Overview - **Utilize the hierarchical structure of images**: By adopting the Hourglass structure, the model can effectively handle image information at different resolutions, thereby improving the quality of generated images while maintaining computational efficiency. - **Local attention mechanism**: To reduce computational complexity, the model employs local attention mechanisms (such as Neighborhood Attention) instead of global self-attention mechanisms, which helps lower computational costs while maintaining sufficient expressive power. - **Other improvements**: These include using Rotary Positional Encoding, cosine similarity attention mechanisms, and adaptive normalization layers to further optimize model performance. Through the above methods, this research not only improves the efficiency of high-resolution image generation but also significantly enhances the quality of the generated images.

Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

Vision Transformers for Single Image Dehazing

DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

Scalable Diffusion Models with Transformers

Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models

DiffiT: Diffusion Vision Transformers for Image Generation

Scalable, Tokenization-Free Diffusion Model Architectures with Efficient Initial Convolution and Fixed-Size Reusable Structures for On-Device Image Generation

Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation

FouriScale: A Frequency Perspective on Training-Free High-Resolution Image Synthesis

Inf-DiT: Upsampling Any-Resolution Image with Memory-Efficient Diffusion Transformer

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Stretching Each Dollar: Diffusion Training from Scratch on a Micro-Budget

Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

Cascaded Diffusion Models for High Fidelity Image Generation

Upsample Guidance: Scale Up Diffusion Models without Training

Taming Transformers for High-Resolution Image Synthesis

Effective Diffusion Transformer Architecture for Image Super-Resolution

An efficient multi‐scale transformer for satellite image dehazing

HiDiffusion: Unlocking Higher-Resolution Creativity and Efficiency in Pretrained Diffusion Models

Ultra-High-Resolution Image Synthesis with Pyramid Diffusion Model

Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation