Abstract:This paper presents Randomized AutoRegressive modeling (RAR) for visual generation, which sets a new state-of-the-art performance on the image generation task while maintaining full compatibility with language modeling frameworks. The proposed RAR is simple: during a standard autoregressive training process with a next-token prediction objective, the input sequence-typically ordered in raster form-is randomly permuted into different factorization orders with a probability r, where r starts at 1 and linearly decays to 0 over the course of training. This annealing training strategy enables the model to learn to maximize the expected likelihood over all factorization orders and thus effectively improve the model's capability of modeling bidirectional contexts. Importantly, RAR preserves the integrity of the autoregressive modeling framework, ensuring full compatibility with language modeling while significantly improving performance in image generation. On the ImageNet-256 benchmark, RAR achieves an FID score of 1.48, not only surpassing prior state-of-the-art autoregressive image generators but also outperforming leading diffusion-based and masked transformer-based methods. Code and models will be made available at <a class="link-external link-https" href="https://github.com/bytedance/1d-tokenizer" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper attempts to address the issue that autoregressive (AR) models perform worse in image generation tasks compared to diffusion models or non-autoregressive transformers. Specifically, traditional autoregressive models struggle to effectively capture bidirectional context due to their unidirectional dependency when processing visual signals, which limits their performance in image generation tasks. The paper proposes a new method—Randomized AutoRegressive modeling (RAR), which aims to enhance the bidirectional context learning ability of autoregressive models by introducing a randomized permutation training strategy, thereby improving the quality of image generation. ### Key Points Summary 1. **Problem Background**: - Autoregressive models perform well in natural language processing tasks but usually underperform in image generation tasks compared to diffusion models or non-autoregressive transformers. - Visual data has bidirectional correlations, while traditional autoregressive models rely on causal attention masking, leading to unidirectional information flow, which is not suitable for visual data. 2. **Solution**: - **Randomized AutoRegressive modeling (RAR)**: By randomly permuting the input sequence during training, the model can learn all possible factorization orders, thereby maximizing the expected likelihood. - **Randomness Annealing**: Using high-probability random permutations in the early stages of training, gradually transitioning to the standard raster scan order to balance bidirectional context learning and generation quality. - **Target-aware Positional Embedding**: Introducing additional positional embeddings so that the model can recognize the position of the target token when predicting the next token, avoiding prediction confusion due to different permutations. 3. **Experimental Results**: - On the ImageNet-256 benchmark, RAR achieved an FID score of 1.48, significantly outperforming previous autoregressive image generators and other leading diffusion models and masked transformers. - Different model variants of RAR (ranging from 261M parameters to 1.5B parameters) showed good scalability, with performance continuously improving as the model size increased. ### Conclusion By introducing a randomized permutation training strategy and target-aware positional embeddings, RAR not only retains the core structure of autoregressive models but also significantly improves the quality of image generation, reaching a new state-of-the-art level in image generation tasks. This method opens up new directions for research in autoregressive visual generation and is expected to drive further development in this field.

Randomized Autoregressive Visual Generation

Emage: Non-Autoregressive Text-to-Image Generation

RandAR: Decoder-only Autoregressive Visual Generation in Random Orders

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Autoregressive Image Generation without Vector Quantization

A Spark of Vision-Language Intelligence: 2-Dimensional Autoregressive Transformer for Efficient Finegrained Image Generation

HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

M-VAR: Decoupled Scale-wise Autoregressive Modeling for High-Quality Image Generation

Denoising Autoregressive Representation Learning

ControlAR: Controllable Image Generation with Autoregressive Models

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation

Customize Your Visual Autoregressive Recipe with Set Autoregressive Modeling

CART: Compositional Auto-Regressive Transformer for Image Generation

Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation

DART: Denoising Autoregressive Transformer for Scalable Text-to-Image Generation

Infinity: Scaling Bitwise AutoRegressive Modeling for High-Resolution Image Synthesis

Collaborative Decoding Makes Visual Auto-Regressive Modeling Efficient

Taming Scalable Visual Tokenizer for Autoregressive Image Generation

LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior

XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation