Abstract:Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these instructions for open-ended tasks. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, all with good results. It's noteworthy that, with a generalist LLM-based framework, our model can achieve over 60\% mAP on COCO, on par with detection-specific models. We hope this model can set a new baseline for generalist vision and language models. The demo shall be released based on <a class="link-external link-https" href="https://github.com/OpenGVLab/InternGPT" rel="external noopener nofollow">this https URL</a>. The code shall be released at <a class="link-external link-https" href="https://github.com/OpenGVLab/VisionLLM" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The main goal of this paper is to propose a new framework, VisionLLM, to address the limitations of current large language models (LLMs) in handling vision-centric tasks. Specifically, the paper aims to address the following key issues: 1. **Unifying Vision and Language Tasks**: Although current vision foundation models (VFMs) are powerful, they are limited in handling open-ended tasks due to predefined task formats, making it difficult to achieve the flexibility demonstrated by LLMs. Therefore, the researchers aim to establish a framework that can unify vision tasks with language tasks. 2. **Open-ended Task Capability**: Existing vision models are usually constrained by predefined task formats, which limits their ability to handle open-ended tasks. The proposed method aims to achieve flexible management of vision tasks by using language instructions to define tasks. 3. **Inconsistency of Visual Prompts**: While visual prompt tuning methods can flexibly define some purely visual tasks to a certain extent, the format of these prompts is inconsistent with the instruction format used by language models. This inconsistency limits the ability to directly apply the reasoning capabilities and world knowledge of LLMs to vision tasks. In summary, the paper proposes a LLM-based framework, VisionLLM, for handling vision-centric tasks, addressing the above issues through the following approaches: - **Unified Language Instruction Design**: The paper proposes a unified language instruction design that can be applied to both vision and vision-language tasks, allowing users to define tasks through simple language descriptions, making tasks more flexible and diverse. - **Language-Guided Image Encoder**: To better understand image content and align with language instructions, the paper develops a language-guided image encoder that can encode image information based on given instructions. - **LLM-based Open-ended Task Decoder**: Finally, the paper introduces an LLM-based open-ended task decoder that can generate appropriate prediction results based on given language instructions, effectively handling various vision-centric tasks. In this way, VisionLLM can not only handle traditional vision tasks such as object detection and instance segmentation but also tackle more complex vision-language tasks such as image description and visual question answering. It can also customize tasks at different levels, from fine-grained object-level to coarse-grained task-level. Additionally, the model's performance on multiple benchmark datasets validates its effectiveness and generality.

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

COGVLM: VISUAL EXPERT FOR LARGE LANGUAGE MODELS

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

LM4LV: A Frozen Large Language Model for Low-level Vision Tasks

InfMLLM: A Unified Framework for Visual-Language Tasks.

An Introduction to Vision-Language Modeling

GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language Models

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

CogVLM: Visual Expert for Pretrained Language Models

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

Effectiveness Assessment of Recent Large Vision-Language Models

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Audio-Visual LLM for Video Understanding

VoCo-LLaMA: Towards Vision Compression with Large Language Models

Beyond Text: Frozen Large Language Models in Visual Signal Comprehension