Abstract:Denoising with a Joint-Embedding Predictive Architecture (D-JEPA), an autoregressive model, has demonstrated outstanding performance in class-conditional image generation. However, the application of next-token prediction in high-resolution text-to-image generation remains underexplored. In this paper, we introduce D-JEPA$\cdot$T2I, an extension of D-JEPA incorporating flow matching loss, designed to enable data-efficient continuous resolution learning. D-JEPA$\cdot$T2I leverages a multimodal visual transformer to effectively integrate textual and visual features and adopts Visual Rotary Positional Embedding (VoPE) to facilitate continuous resolution learning. Furthermore, we devise a data feedback mechanism that significantly enhances data utilization efficiency. For the first time, we achieve state-of-the-art \textbf{high-resolution} image synthesis via next-token prediction. The experimental code and pretrained models will be open-sourced at \url{<a class="link-external link-https" href="https://d-jepa.github.io/t2i" rel="external noopener nofollow">this https URL</a>}.

What problem does this paper attempt to address?

The main problem that this paper attempts to solve is the deficiencies of existing autoregressive models in high - resolution text - to - image generation. Specifically: 1. **Limitations of existing models**: - Autoregressive models perform poorly in high - resolution image generation, especially in terms of image texture and overall quality. - Existing models face challenges when dealing with images of arbitrary resolutions and aspect ratios, mainly due to the lack of appropriate visual position encoding. 2. **Improving the performance of autoregressive models**: - By introducing the Multimodal Visual Transformer to integrate text and visual features more effectively. - Use Flow Matching Loss instead of the original Diffusion Loss to achieve faster convergence and greater flexibility. - Introduce Visual Rotary Positional Embedding (VoPE) to ensure consistent position information at different resolutions. 3. **Improving data utilization efficiency**: - A data feedback mechanism is proposed to dynamically adjust the training data distribution and optimize data usage efficiency, especially for the long - tail data problem in large - scale datasets. Through these improvements, the D - JEPA·T2I model can achieve state - of - the - art performance in high - resolution text - to - image generation tasks, especially when dealing with high - fidelity, high - resolution images. ### Formula summary - **Flow Matching Loss**: \[ L_{\text{flow}}(x_i, z_i)=\int_{0}^{1} \mathbb{E}\left[\left\|v_\theta(x_t^i, t, z_i)-(x_i - \epsilon)\right\|^{2}\right]dt \] where $v_\theta(x_t^i, t, z_i)$ is the time - dependent velocity field, defined as: \[ v_t(x_t^i, z_i)=\dot{\alpha}_t x_i+\dot{\beta}_t \epsilon=x_i - \epsilon \] - **Visual Rotary Positional Embedding (VoPE)**: \[ \langle f_q(x_m, \frac{1}{\rho}(m + b)), f_k(x_n, \frac{1}{\rho}(n + b))\rangle=g(x_m, x_n, \frac{1}{\rho}(m - n)) \] where $\rho$ is the resolution density and $b$ is the relative position offset. These formulas show how the model improves the quality and efficiency of high - resolution image generation through flow matching loss and visual rotary positional encoding.

High-Resolution Image Synthesis via Next-Token Prediction

Learn, Imagine and Create: Text-to-Image Generation from Prior Knowledge.

Emage: Non-Autoregressive Text-to-Image Generation

Denoising with a Joint-Embedding Predictive Architecture

PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding

Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis

UltraPixel: Advancing Ultra-High-Resolution Image Synthesis to New Peaks

DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Unleashing Transformers: Parallel Token Prediction with Discrete Absorbing Diffusion for Fast High-Resolution Image Generation from Vector-Quantized Codes

Rich Human Feedback for Text-to-Image Generation

TokenUnify: Scalable Autoregressive Visual Pre-training with Mixture Token Prediction

PixArt-$α$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Taming Transformers for High-Resolution Image Synthesis

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

High-Resolution Image Synthesis with Latent Diffusion Models

Adaptive Semantic-Enhanced Denoising Diffusion Probabilistic Model for Remote Sensing Image Super-Resolution

PanGu-Draw: Advancing Resource-Efficient Text-to-Image Synthesis with Time-Decoupled Training and Reusable Coop-Diffusion

VideoElevator: Elevating Video Generation Quality with Versatile Text-to-Image Diffusion Models

ENAT: Rethinking Spatial-temporal Interactions in Token-based Image Synthesis

Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data