A Survey on Vision Autoregressive Model

Kai Jiang,Jiaxing Huang
2024-11-13
Abstract:Autoregressive models have demonstrated great performance in natural language processing (NLP) with impressive scalability, adaptability and generalizability. Inspired by their notable success in NLP field, autoregressive models have been intensively investigated recently for computer vision, which perform next-token predictions by representing visual data as visual tokens and enables autoregressive modelling for a wide range of vision tasks, ranging from visual generation and visual understanding to the very recent multimodal generation that unifies visual generation and understanding with a single autoregressive model. This paper provides a systematic review of vision autoregressive models, including the development of a taxonomy of existing methods and highlighting their major contributions, strengths, and limitations, covering various vision tasks such as image generation, video generation, image editing, motion generation, medical image analysis, 3D generation, robotic manipulation, unified multimodal generation, etc. Besides, we investigate and analyze the latest advancements in autoregressive models, including thorough benchmarking and discussion of existing methods across various evaluation datasets. Finally, we outline key challenges and promising directions for future research, offering a roadmap to guide further advancements in vision autoregressive models.
Computer Vision and Pattern Recognition,Artificial Intelligence
What problem does this paper attempt to address?
The problem this paper attempts to address is: Autoregressive (AR) models have achieved significant success in the field of Natural Language Processing (NLP), demonstrating excellent scalability, adaptability, and generalization capabilities. Inspired by this, researchers have recently begun applying autoregressive models to computer vision tasks, such as image generation, video generation, image editing, motion generation, medical image analysis, 3D generation, robotic manipulation, and multimodal generation. However, there is currently a lack of a systematic review of these visual autoregressive models, making it difficult to fully understand existing methods, challenges, and future research directions. Therefore, this paper aims to fill this gap by providing a systematic review in the following three aspects: 1. **Systematic Review**: Provide an overview of the applications of visual autoregressive models in various tasks, including image generation, image understanding, etc., and develop a classification system for existing methods, highlighting their main contributions, advantages, and limitations. 2. **Analysis of Recent Advances**: Conduct an in-depth study and analysis of the latest advances in autoregressive models, including benchmarking and discussion on various evaluation datasets. 3. **Future Research Directions**: Identify and discuss several challenges and promising research directions in current research to guide the community in addressing open problems and advancing the field further. Through these efforts, the paper hopes to provide the research community with a clear overview, showcasing the achievements made, the challenges faced, and the future research directions.