Abstract:Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these instructions for open-ended tasks. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, all with good results. It's noteworthy that, with a generalist LLM-based framework, our model can achieve over 60\% mAP on COCO, on par with detection-specific models. We hope this model can set a new baseline for generalist vision and language models. The demo shall be released based on <a class="link-external link-https" href="https://github.com/OpenGVLab/InternGPT" rel="external noopener nofollow">this https URL</a>. The code shall be released at <a class="link-external link-https" href="https://github.com/OpenGVLab/VisionLLM" rel="external noopener nofollow">this https URL</a>.

LLMFormer: Large Language Model for Open-Vocabulary Semantic Segmentation

LMSeg: Unleashing the Power of Large-Scale Models for Open-Vocabulary Semantic Segmentation

Training-Free Semantic Segmentation via LLM-Supervision

LLM-Seg: Bridging Image Segmentation and Large Language Model Reasoning

VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

SemiVL: Semi-Supervised Semantic Segmentation with Vision-Language Guidance

Few-Shot Classification & Segmentation Using Large Language Models Agent

Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models

LLaFS: When Large Language Models Meet Few-Shot Segmentation

HyperSeg: Towards Universal Visual Segmentation with Large Language Model

Learning Open-vocabulary Semantic Segmentation Models from Natural Language Supervision.

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding

Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation

RelationVLM: Making Large Vision-Language Models Understand Visual Relations

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

InfMLLM: A Unified Framework for Visual-Language Tasks.

SAM4MLLM: Enhance Multi-Modal Large Language Model for Referring Expression Segmentation

LLMs4OL: Large Language Models for Ontology Learning

LLM4Brain: Training a Large Language Model for Brain Video Understanding

Visual Perception by Large Language Model's Weights