Abstract:Conventional wisdom holds that autoregressive models for image generation are typically accompanied by vector-quantized tokens. We observe that while a discrete-valued space can facilitate representing a categorical distribution, it is not a necessity for autoregressive modeling. In this work, we propose to model the per-token probability distribution using a diffusion procedure, which allows us to apply autoregressive models in a continuous-valued space. Rather than using categorical cross-entropy loss, we define a Diffusion Loss function to model the per-token probability. This approach eliminates the need for discrete-valued tokenizers. We evaluate its effectiveness across a wide range of cases, including standard autoregressive models and generalized masked autoregressive (MAR) variants. By removing vector quantization, our image generator achieves strong results while enjoying the speed advantage of sequence modeling. We hope this work will motivate the use of autoregressive generation in other continuous-valued domains and applications. Code is available at: <a class="link-external link-https" href="https://github.com/LTH14/mar" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem this paper attempts to address is: Is it necessary for autoregressive models in image generation to be combined with vector quantization representations? The authors observe that while discrete value spaces can conveniently represent categorical distributions, this is not a necessary condition for autoregressive modeling. Therefore, they propose a method to model the probability distribution of each token in a continuous value space using a diffusion process, thereby eliminating the need for discrete value tokenizers. Specifically, the authors propose the following points: 1. **Problem Background**: Traditional autoregressive models typically rely on discrete value tokenizers (such as VQ-VAE), which convert images into discrete value sequences through vector quantization. This method is widely adopted in image generation but has issues such as training difficulty and sensitivity to gradient approximation strategies. 2. **Research Motivation**: The authors believe that the core of autoregressive models lies in "predicting the next token based on previous tokens," which is independent of whether the values are discrete or continuous. The key is how to model the probability distribution of each token. 3. **Solution**: The authors propose a new method that uses a diffusion process to model the probability distribution of each token. This method allows autoregressive models to operate in continuous value spaces without the need for discrete value tokenizers. 4. **Technical Details**: The authors define a diffusion loss function to replace the traditional categorical cross-entropy loss. The diffusion loss function predicts noise vectors through a small MLP network to model the probability distribution of each token. 5. **Experimental Results**: The authors validate the effectiveness of this method in multiple experiments, including standard autoregressive models and generalized masked autoregressive models. The experimental results show that this method not only improves generation quality but also has faster generation speed. In summary, this paper aims to explore whether it is possible to eliminate the dependence on discrete value tokenizers when using autoregressive models in image generation and proposes a new method based on the diffusion process to achieve this goal.

Autoregressive Image Generation without Vector Quantization

Emage: Non-Autoregressive Text-to-Image Generation

Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation

Towards Accurate Image Coding: Improved Autoregressive Image Generation with Dynamic Vector Quantization

Autoregressive Video Generation without Vector Quantization

E-CAR: Efficient Continuous Autoregressive Image Generation via Multistage Modeling

ControlAR: Controllable Image Generation with Autoregressive Models

Vector Quantized Diffusion Model for Text-to-Image Synthesis

Regularized Vector Quantization for Tokenized Image Synthesis

Efficient Generative Modeling with Residual Vector Quantization-Based Tokens

Stabilize the Latent Space for Image Autoregressive Modeling: A Unified Perspective

Randomized Autoregressive Visual Generation

Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

Autoregressive Image Generation using Residual Quantization

Taming Scalable Visual Tokenizer for Autoregressive Image Generation

Self-control: A Better Conditional Mechanism for Masked Autoregressive Model

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation

XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation