A Survey on Vision Autoregressive Model

Kai Jiang,Jiaxing Huang

2024-11-13

Abstract:Autoregressive models have demonstrated great performance in natural language processing (NLP) with impressive scalability, adaptability and generalizability. Inspired by their notable success in NLP field, autoregressive models have been intensively investigated recently for computer vision, which perform next-token predictions by representing visual data as visual tokens and enables autoregressive modelling for a wide range of vision tasks, ranging from visual generation and visual understanding to the very recent multimodal generation that unifies visual generation and understanding with a single autoregressive model. This paper provides a systematic review of vision autoregressive models, including the development of a taxonomy of existing methods and highlighting their major contributions, strengths, and limitations, covering various vision tasks such as image generation, video generation, image editing, motion generation, medical image analysis, 3D generation, robotic manipulation, unified multimodal generation, etc. Besides, we investigate and analyze the latest advancements in autoregressive models, including thorough benchmarking and discussion of existing methods across various evaluation datasets. Finally, we outline key challenges and promising directions for future research, offering a roadmap to guide further advancements in vision autoregressive models.

Computer Vision and Pattern Recognition,Artificial Intelligence

What problem does this paper attempt to address?

The problem this paper attempts to address is: Autoregressive (AR) models have achieved significant success in the field of Natural Language Processing (NLP), demonstrating excellent scalability, adaptability, and generalization capabilities. Inspired by this, researchers have recently begun applying autoregressive models to computer vision tasks, such as image generation, video generation, image editing, motion generation, medical image analysis, 3D generation, robotic manipulation, and multimodal generation. However, there is currently a lack of a systematic review of these visual autoregressive models, making it difficult to fully understand existing methods, challenges, and future research directions. Therefore, this paper aims to fill this gap by providing a systematic review in the following three aspects: 1. **Systematic Review**: Provide an overview of the applications of visual autoregressive models in various tasks, including image generation, image understanding, etc., and develop a classification system for existing methods, highlighting their main contributions, advantages, and limitations. 2. **Analysis of Recent Advances**: Conduct an in-depth study and analysis of the latest advances in autoregressive models, including benchmarking and discussion on various evaluation datasets. 3. **Future Research Directions**: Identify and discuss several challenges and promising research directions in current research to guide the community in addressing open problems and advancing the field further. Through these efforts, the paper hopes to provide the research community with a clear overview, showcasing the achievements made, the challenges faced, and the future research directions.

A Survey on Vision Autoregressive Model

Autoregressive Models in Vision: A Survey

Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective

Data-efficient Large Vision Models through Sequential Autoregression

Exploring Stochastic Autoregressive Image Modeling for Visual Representation

A survey of generative models used in text-to-image

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Vision Language Models in Autonomous Driving: A Survey and Outlook

A Survey on Non-Autoregressive Generation for Neural Machine Translation and Beyond

VQA and Visual Reasoning: An Overview of Recent Datasets, Methods and Challenges

Multi-modal Auto-regressive Modeling via Visual Words

Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation

Vision-Language Models for Vision Tasks: A Survey

Remote Sensing Temporal Vision-Language Models: A Comprehensive Survey

A Survey on Visual Transformer

CAR: Controllable Autoregressive Modeling for Visual Generation

Vision-Language Models in Remote Sensing: Current progress and future trends

A Survey on Vision Transformer

A Survey on Vision Mamba: Models, Applications and Challenges

Generative AI in Vision: A Survey on Models, Metrics and Applications