GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

Shilong Zhang,Peize Sun,Shoufa Chen,Min Xiao,Wenqi Shao,Wenwei Zhang,Yu Liu,Kai Chen,Ping Luo

2024-06-01

Abstract:Visual instruction tuning large language model(LLM) on image-text pairs has achieved general-purpose vision-language abilities. However, the lack of region-text pairs limits their advancements to fine-grained multimodal understanding. In this paper, we propose spatial instruction tuning, which introduces the reference to the region-of-interest(RoI) in the instruction. Before sending to LLM, the reference is replaced by RoI features and interleaved with language embeddings as a sequence. Our model GPT4RoI, trained on 7 region-text pair datasets, brings an unprecedented interactive and conversational experience compared to previous image-level models. (1) Interaction beyond language: Users can interact with our model by both language and drawing bounding boxes to flexibly adjust the referring granularity. (2) Versatile multimodal abilities: A variety of attribute information within each RoI can be mined by GPT4RoI, e.g., color, shape, material, action, etc. Furthermore, it can reason about multiple RoIs based on common sense. On the Visual Commonsense Reasoning(VCR) dataset, GPT4RoI achieves a remarkable accuracy of 81.6%, surpassing all existing models by a significant margin (the second place is 75.6%) and almost reaching human-level performance of 85.0%. The code, dataset, and demo can be found at <a class="link-external link-https" href="https://github.com/jshilong/GPT4RoI" rel="external noopener nofollow">this https URL</a>.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper attempts to address the problem of achieving fine-grained understanding of Regions-of-Interest (RoI) in visual language models. Specifically, existing large-scale language models (LLMs) based on image-text pairs perform well on multimodal tasks but have limitations in fine-grained understanding tasks such as region description and reasoning. To solve this problem, the paper proposes spatial instruction tuning, which integrates RoI references into language instructions and processes them through an interleaved sequence of RoI features and language embeddings. This approach enables the model to interact not only at the image level but also to flexibly adjust the granularity of references through user-drawn bounding boxes, thereby achieving more complex multimodal capabilities. For example, it can extract multiple attribute information (color, shape, material, action, etc.) from each RoI and perform reasoning on multiple RoIs based on common sense. Experimental results show that the model significantly outperforms existing models on the Visual Commonsense Reasoning (VCR) dataset, approaching human-level performance.

GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models

RegionGPT: Towards Region Understanding Vision Language Model

Aligning Large Multi-Modal Model with Robust Instruction Tuning

SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model

LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding

InstructionGPT-4: A 200-Instruction Paradigm for Fine-Tuning MiniGPT-4

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data

RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding

SVIT: Scaling up Visual Instruction Tuning

VIGC: Visual Instruction Generation and Correction

Personalized Visual Instruction Tuning

Vision-Language Instruction Tuning: A Review and Analysis

Vision-Flan: Scaling Human-Labeled Tasks in Visual Instruction Tuning

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

Reconstructive Visual Instruction Tuning

GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction

Exploring Visual Culture Awareness in GPT-4V: A Comprehensive Probing

SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding