Scalable High-Resolution Pixel-Space Image Synthesis with Hourglass Diffusion Transformers

Katherine Crowson,Stefan Andreas Baumann,Alex Birch,Tanishq Mathew Abraham,Daniel Z. Kaplan,Enrico Shippole
2024-01-22
Abstract:We present the Hourglass Diffusion Transformer (HDiT), an image generative model that exhibits linear scaling with pixel count, supporting training at high-resolution (e.g. $1024 \times 1024$) directly in pixel-space. Building on the Transformer architecture, which is known to scale to billions of parameters, it bridges the gap between the efficiency of convolutional U-Nets and the scalability of Transformers. HDiT trains successfully without typical high-resolution training techniques such as multiscale architectures, latent autoencoders or self-conditioning. We demonstrate that HDiT performs competitively with existing models on ImageNet $256^2$, and sets a new state-of-the-art for diffusion models on FFHQ-$1024^2$.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The paper primarily aims to address the efficiency and quality issues in high-resolution image generation, particularly the challenges faced when using diffusion models for pixel-space image synthesis. Specifically, the research team proposed the Hourglass Diffusion Transformer (HDiT), a novel pure Transformer architecture designed to achieve efficient and high-quality large-scale image generation. ### Research Objectives 1. **Develop an efficient Transformer architecture**: To overcome the problem of rapidly increasing computational costs when processing high-resolution images with traditional diffusion models, the paper proposes a new Transformer architecture—Hourglass Diffusion Transformer (HDiT), which can handle the growth in image size with linear complexity (instead of the traditional quadratic complexity). 2. **Achieve high-quality image generation**: By introducing a series of architectural improvements, this model can generate high-quality high-resolution images directly in pixel space without sacrificing image quality. ### Main Contributions 1. **Explore how to adapt the Transformer architecture**: The paper discusses in detail how to adjust Transformer-based diffusion models to efficiently generate high-quality images in pixel space. 2. **Introduce the HDiT architecture**: A new architecture named Hourglass Diffusion Transformer is proposed, which supports sub-quadratic computational cost growth and maintains good performance even at high resolutions. 3. **Demonstrate the model's scalability**: Experiments show that this architecture can scale to high resolutions such as 1024x1024 without requiring high-resolution-specific training techniques and remains competitive at lower resolutions. ### Method Overview - **Utilize the hierarchical structure of images**: By adopting the Hourglass structure, the model can effectively handle image information at different resolutions, thereby improving the quality of generated images while maintaining computational efficiency. - **Local attention mechanism**: To reduce computational complexity, the model employs local attention mechanisms (such as Neighborhood Attention) instead of global self-attention mechanisms, which helps lower computational costs while maintaining sufficient expressive power. - **Other improvements**: These include using Rotary Positional Encoding, cosine similarity attention mechanisms, and adaptive normalization layers to further optimize model performance. Through the above methods, this research not only improves the efficiency of high-resolution image generation but also significantly enhances the quality of the generated images.