VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

Wenhai Wang,Zhe Chen,Xiaokang Chen,Jiannan Wu,Xizhou Zhu,Gang Zeng,Ping Luo,Tong Lu,Jie Zhou,Yu Qiao,Jifeng Dai
2023-05-25
Abstract:Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these instructions for open-ended tasks. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, all with good results. It's noteworthy that, with a generalist LLM-based framework, our model can achieve over 60\% mAP on COCO, on par with detection-specific models. We hope this model can set a new baseline for generalist vision and language models. The demo shall be released based on <a class="link-external link-https" href="https://github.com/OpenGVLab/InternGPT" rel="external noopener nofollow">this https URL</a>. The code shall be released at <a class="link-external link-https" href="https://github.com/OpenGVLab/VisionLLM" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main goal of this paper is to propose a new framework, VisionLLM, to address the limitations of current large language models (LLMs) in handling vision-centric tasks. Specifically, the paper aims to address the following key issues: 1. **Unifying Vision and Language Tasks**: Although current vision foundation models (VFMs) are powerful, they are limited in handling open-ended tasks due to predefined task formats, making it difficult to achieve the flexibility demonstrated by LLMs. Therefore, the researchers aim to establish a framework that can unify vision tasks with language tasks. 2. **Open-ended Task Capability**: Existing vision models are usually constrained by predefined task formats, which limits their ability to handle open-ended tasks. The proposed method aims to achieve flexible management of vision tasks by using language instructions to define tasks. 3. **Inconsistency of Visual Prompts**: While visual prompt tuning methods can flexibly define some purely visual tasks to a certain extent, the format of these prompts is inconsistent with the instruction format used by language models. This inconsistency limits the ability to directly apply the reasoning capabilities and world knowledge of LLMs to vision tasks. In summary, the paper proposes a LLM-based framework, VisionLLM, for handling vision-centric tasks, addressing the above issues through the following approaches: - **Unified Language Instruction Design**: The paper proposes a unified language instruction design that can be applied to both vision and vision-language tasks, allowing users to define tasks through simple language descriptions, making tasks more flexible and diverse. - **Language-Guided Image Encoder**: To better understand image content and align with language instructions, the paper develops a language-guided image encoder that can encode image information based on given instructions. - **LLM-based Open-ended Task Decoder**: Finally, the paper introduces an LLM-based open-ended task decoder that can generate appropriate prediction results based on given language instructions, effectively handling various vision-centric tasks. In this way, VisionLLM can not only handle traditional vision tasks such as object detection and instance segmentation but also tackle more complex vision-language tasks such as image description and visual question answering. It can also customize tasks at different levels, from fine-grained object-level to coarse-grained task-level. Additionally, the model's performance on multiple benchmark datasets validates its effectiveness and generality.