Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser,Sumith Kulal,Andreas Blattmann,Rahim Entezari,Jonas Müller,Harry Saini,Yam Levi,Dominik Lorenz,Axel Sauer,Frederic Boesel,Dustin Podell,Tim Dockhorn,Zion English,Kyle Lacey,Alex Goodwin,Yannik Marek,Robin Rombach

2024-03-06

Abstract:Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the-art models, and we will make our experimental data, code, and model weights publicly available.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address several key issues in high-resolution image generation: 1. **Improving the Rectified Flow Model**: - Proposes a new noise sampling technique to train the Rectified Flow model and demonstrates its superior performance in high-resolution text-to-image synthesis tasks through extensive experiments. - Compared to existing diffusion models (such as LDM-Linear), the Rectified Flow model shows better performance after modifying the time step sampling method. 2. **Introducing a New Transformer Architecture**: - Designs a new Transformer architecture for text-to-image generation. This architecture uses separate weights to handle image and text modalities and allows bidirectional information flow between them, thereby improving text understanding, font effects, and human preference scores. 3. **Validating Model Scalability**: - Demonstrates that this architecture follows predictable scaling trends and shows a strong correlation between lower validation loss and improved text-to-image synthesis performance. - The largest model outperforms current state-of-the-art models (such as SDXL, DALL-E 3, etc.) in various quantitative evaluation metrics and human assessments. The core contribution of the paper lies in improving the training methods of existing models and proposing a new architecture design, making the model perform better in high-resolution image generation tasks.

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Text-to-Image Rectified Flow as Plug-and-Play Priors

Boosting Latent Diffusion with Flow Matching

FlowDreamer: Exploring High Fidelity Text-to-3D Generation via Rectified Flow

Stable Flow: Vital Layers for Training-Free Image Editing

Efficient Scaling of Diffusion Transformers for Text-to-Image Generation

I-Max: Maximize the Resolution Potential of Pre-trained Rectified Flow Transformers with Projected Flow

Jet: A Modern Transformer-Based Normalizing Flow

InstaFlow: One Step is Enough for High-Quality Diffusion-Based Text-to-Image Generation

Diffscaler: Enhancing the Generative Prowess of Diffusion Transformers

Improving the Training of Rectified Flows

Taming Rectified Flow for Inversion and Editing

Unveil Inversion and Invariance in Flow Transformer for Versatile Image Editing

Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

Diffusion-RWKV: Scaling RWKV-Like Architectures for Diffusion Models

Towards Precise Scaling Laws for Video Diffusion Transformers

Scaling Laws For Diffusion Transformers

Semantic Image Inversion and Editing using Rectified Stochastic Differential Equations

On the Scalability of Diffusion-based Text-to-Image Generation

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens