Randomized Autoregressive Visual Generation

Qihang Yu,Ju He,Xueqing Deng,Xiaohui Shen,Liang-Chieh Chen
2024-11-02
Abstract:This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. The proposed RAR is simple: during a standard autoregressive training process with a next-token prediction objective, the input sequence-typically ordered in raster form-is randomly permuted into different factorization orders with a probability r, where r starts at 1 and linearly decays to 0 over the course of training. This annealing training strategy enables the model to learn to maximize the expected likelihood over all factorization orders and thus effectively improve the model's capability of modeling bidirectional contexts. Importantly, RAR preserves the integrity of the autoregressive modeling framework, ensuring full compatibility with language modeling while significantly improving performance in image generation. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-of-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods. Code and models will be made available at <a class="link-external link-https" href="https://github.com/bytedance/1d-tokenizer" rel="external noopener nofollow">this https URL</a>
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper attempts to address the issue that autoregressive (AR) models perform worse in image generation tasks compared to diffusion models or non-autoregressive transformers. Specifically, traditional autoregressive models struggle to effectively capture bidirectional context due to their unidirectional dependency when processing visual signals, which limits their performance in image generation tasks. The paper proposes a new method—Randomized AutoRegressive modeling (RAR), which aims to enhance the bidirectional context learning ability of autoregressive models by introducing a randomized permutation training strategy, thereby improving the quality of image generation. ### Key Points Summary 1. **Problem Background**: - Autoregressive models perform well in natural language processing tasks but usually underperform in image generation tasks compared to diffusion models or non-autoregressive transformers. - Visual data has bidirectional correlations, while traditional autoregressive models rely on causal attention masking, leading to unidirectional information flow, which is not suitable for visual data. 2. **Solution**: - **Randomized AutoRegressive modeling (RAR)**: By randomly permuting the input sequence during training, the model can learn all possible factorization orders, thereby maximizing the expected likelihood. - **Randomness Annealing**: Using high-probability random permutations in the early stages of training, gradually transitioning to the standard raster scan order to balance bidirectional context learning and generation quality. - **Target-aware Positional Embedding**: Introducing additional positional embeddings so that the model can recognize the position of the target token when predicting the next token, avoiding prediction confusion due to different permutations. 3. **Experimental Results**: - On the ImageNet-256 benchmark, RAR achieved an FID score of 1.48, significantly outperforming previous autoregressive image generators and other leading diffusion models and masked transformers. - Different model variants of RAR (ranging from 261M parameters to 1.5B parameters) showed good scalability, with performance continuously improving as the model size increased. ### Conclusion By introducing a randomized permutation training strategy and target-aware positional embeddings, RAR not only retains the core structure of autoregressive models but also significantly improves the quality of image generation, reaching a new state-of-the-art level in image generation tasks. This method opens up new directions for research in autoregressive visual generation and is expected to drive further development in this field.