High-Resolution Image Synthesis via Next-Token Prediction

Dengsheng Chen,Jie Hu,Tiezhu Yue,Xiaoming Wei
2024-11-22
Abstract:Denoising with a Joint-Embedding Predictive Architecture (D-JEPA), an autoregressive model, has demonstrated outstanding performance in class-conditional image generation. However, the application of next-token prediction in high-resolution text-to-image generation remains underexplored. In this paper, we introduce D-JEPA$\cdot$T2I, an extension of D-JEPA incorporating flow matching loss, designed to enable data-efficient continuous resolution learning. D-JEPA$\cdot$T2I leverages a multimodal visual transformer to effectively integrate textual and visual features and adopts Visual Rotary Positional Embedding (VoPE) to facilitate continuous resolution learning. Furthermore, we devise a data feedback mechanism that significantly enhances data utilization efficiency. For the first time, we achieve state-of-the-art \textbf{high-resolution} image synthesis via next-token prediction. The experimental code and pretrained models will be open-sourced at \url{<a class="link-external link-https" href="https://d-jepa.github.io/t2i" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the deficiencies of existing autoregressive models in high - resolution text - to - image generation. Specifically: 1. **Limitations of existing models**: - Autoregressive models perform poorly in high - resolution image generation, especially in terms of image texture and overall quality. - Existing models face challenges when dealing with images of arbitrary resolutions and aspect ratios, mainly due to the lack of appropriate visual position encoding. 2. **Improving the performance of autoregressive models**: - By introducing the Multimodal Visual Transformer to integrate text and visual features more effectively. - Use Flow Matching Loss instead of the original Diffusion Loss to achieve faster convergence and greater flexibility. - Introduce Visual Rotary Positional Embedding (VoPE) to ensure consistent position information at different resolutions. 3. **Improving data utilization efficiency**: - A data feedback mechanism is proposed to dynamically adjust the training data distribution and optimize data usage efficiency, especially for the long - tail data problem in large - scale datasets. Through these improvements, the D - JEPA·T2I model can achieve state - of - the - art performance in high - resolution text - to - image generation tasks, especially when dealing with high - fidelity, high - resolution images. ### Formula summary - **Flow Matching Loss**: \[ L_{\text{flow}}(x_i, z_i)=\int_{0}^{1} \mathbb{E}\left[\left\|v_\theta(x_t^i, t, z_i)-(x_i - \epsilon)\right\|^{2}\right]dt \] where \(v_\theta(x_t^i, t, z_i)\) is the time - dependent velocity field, defined as: \[ v_t(x_t^i, z_i)=\dot{\alpha}_t x_i+\dot{\beta}_t \epsilon=x_i - \epsilon \] - **Visual Rotary Positional Embedding (VoPE)**: \[ \langle f_q(x_m, \frac{1}{\rho}(m + b)), f_k(x_n, \frac{1}{\rho}(n + b))\rangle=g(x_m, x_n, \frac{1}{\rho}(m - n)) \] where \(\rho\) is the resolution density and \(b\) is the relative position offset. These formulas show how the model improves the quality and efficiency of high - resolution image generation through flow matching loss and visual rotary positional encoding.