Autoregressive Image Generation without Vector Quantization

Tianhong Li,Yonglong Tian,He Li,Mingyang Deng,Kaiming He
2024-11-01
Abstract:Conventional wisdom holds that autoregressive models for image generation are typically accompanied by vector-quantized tokens. We observe that while a discrete-valued space can facilitate representing a categorical distribution, it is not a necessity for autoregressive modeling. In this work, we propose to model the per-token probability distribution using a diffusion procedure, which allows us to apply autoregressive models in a continuous-valued space. Rather than using categorical cross-entropy loss, we define a Diffusion Loss function to model the per-token probability. This approach eliminates the need for discrete-valued tokenizers. We evaluate its effectiveness across a wide range of cases, including standard autoregressive models and generalized masked autoregressive (MAR) variants. By removing vector quantization, our image generator achieves strong results while enjoying the speed advantage of sequence modeling. We hope this work will motivate the use of autoregressive generation in other continuous-valued domains and applications. Code is available at: <a class="link-external link-https" href="https://github.com/LTH14/mar" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is: Is it necessary for autoregressive models in image generation to be combined with vector quantization representations? The authors observe that while discrete value spaces can conveniently represent categorical distributions, this is not a necessary condition for autoregressive modeling. Therefore, they propose a method to model the probability distribution of each token in a continuous value space using a diffusion process, thereby eliminating the need for discrete value tokenizers. Specifically, the authors propose the following points: 1. **Problem Background**: Traditional autoregressive models typically rely on discrete value tokenizers (such as VQ-VAE), which convert images into discrete value sequences through vector quantization. This method is widely adopted in image generation but has issues such as training difficulty and sensitivity to gradient approximation strategies. 2. **Research Motivation**: The authors believe that the core of autoregressive models lies in "predicting the next token based on previous tokens," which is independent of whether the values are discrete or continuous. The key is how to model the probability distribution of each token. 3. **Solution**: The authors propose a new method that uses a diffusion process to model the probability distribution of each token. This method allows autoregressive models to operate in continuous value spaces without the need for discrete value tokenizers. 4. **Technical Details**: The authors define a diffusion loss function to replace the traditional categorical cross-entropy loss. The diffusion loss function predicts noise vectors through a small MLP network to model the probability distribution of each token. 5. **Experimental Results**: The authors validate the effectiveness of this method in multiple experiments, including standard autoregressive models and generalized masked autoregressive models. The experimental results show that this method not only improves generation quality but also has faster generation speed. In summary, this paper aims to explore whether it is possible to eliminate the dependence on discrete value tokenizers when using autoregressive models in image generation and proposes a new method based on the diffusion process to achieve this goal.