Customize Your Visual Autoregressive Recipe with Set Autoregressive Modeling

Wenze Liu,Le Zhuo,Yi Xin,Sheng Xia,Peng Gao,Xiangyu Yue
2024-10-14
Abstract:We introduce a new paradigm for AutoRegressive (AR) image generation, termed Set AutoRegressive Modeling (SAR). SAR generalizes the conventional AR to the next-set setting, i.e., splitting the sequence into arbitrary sets containing multiple tokens, rather than outputting each token in a fixed raster order. To accommodate SAR, we develop a straightforward architecture termed Fully Masked Transformer. We reveal that existing AR variants correspond to specific design choices of sequence order and output intervals within the SAR framework, with AR and Masked AR (MAR) as two extreme instances. Notably, SAR facilitates a seamless transition from AR to MAR, where intermediate states allow for training a causal model that benefits from both few-step inference and KV cache acceleration, thus leveraging the advantages of both AR and MAR. On the ImageNet benchmark, we carefully explore the properties of SAR by analyzing the impact of sequence order and output intervals on performance, as well as the generalization ability regarding inference order and steps. We further validate the potential of SAR by training a 900M text-to-image model capable of synthesizing photo-realistic images with any resolution. We hope our work may inspire more exploration and application of AR-based modeling across diverse modalities.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem this paper attempts to address is that existing autoregressive (AR) image generation models require a large number of inference steps to generate high-quality images, which has become a bottleneck. To overcome this issue, the authors propose a new autoregressive paradigm—Set Autoregressive Modeling (SAR), which achieves more flexible model design by extending sequence order and output intervals. SAR not only unifies existing AR variants but also provides a smooth transition path from traditional AR to Masked Autoregressive (MAR), thereby combining the advantages of AR and MAR, such as fewer inference steps, KV cache acceleration, and image editing. Specifically, the main contributions of the paper include: 1. **Proposing Set Autoregressive Modeling (SAR)**: SAR extends traditional AR models to more general settings by arbitrarily configuring sequence order and output intervals, thereby introducing new model states that combine the advantages of AR and MAR. 2. **Designing Full Mask Transformer (FMT)**: FMT is a new model architecture capable of causal learning under arbitrary sequence order and output intervals, supporting training and inference under the SAR framework. 3. **Extensive Experimental Validation**: The authors conducted extensive experiments on the ImageNet benchmark, exploring the performance characteristics of SAR and its potential applications in text-to-image generation and editing tasks. Through these contributions, the paper aims to provide new ideas and methods for the development of autoregressive image generation models, further improving generation quality and efficiency.