Abstract:The remote sensing image intelligence understanding model is undergoing a new profound paradigm shift which has been promoted by multi-modal large language model (MLLM), i.e. from the paradigm learning a domain model (LaDM) shifts to paradigm learning a pre-trained general foundation model followed by an adaptive domain model (LaGD). Under the new LaGD paradigm, the old datasets, which have led to advances in RSI intelligence understanding in the last decade, are no longer suitable for fire-new tasks. We argued that a new dataset must be designed to lighten tasks with the following features: 1) Generalization: training model to learn shared knowledge among tasks and to adapt to different tasks; 2) Understanding complex scenes: training model to understand the fine-grained attribute of the objects of interest, and to be able to describe the scene with natural language; 3) Reasoning: training model to be able to realize high-level visual reasoning. In this paper, we designed a high-quality, diversified, and unified multimodal instruction-following dataset for RSI understanding produced by GPT-4V and existing datasets, which we called RS-GPT4V. To achieve generalization, we used a (Question, Answer) which was deduced from GPT-4V via instruction-following to unify the tasks such as captioning and localization; To achieve complex scene, we proposed a hierarchical instruction description with local strategy in which the fine-grained attributes of the objects and their spatial relationships are described and global strategy in which all the local information are integrated to yield detailed instruction descript; To achieve reasoning, we designed multiple-turn QA pair to provide the reasoning ability for a model. The empirical results show that the fine-tuned MLLMs by RS-GPT4V can describe fine-grained information. The dataset is available at: <a class="link-external link-https" href="https://github.com/GeoX-Lab/RS-GPT4V" rel="external noopener nofollow">this https URL</a>.

Earthgpt: A universal multi-modal large language model for multi-sensor image comprehension in remote sensing domain

EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain

EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain

LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation

From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing

LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

EarthMarker: A Visual Prompting Multi-modal Large Language Model for Remote Sensing

On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications

SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model

RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding

LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

GroundingGPT:Language Enhanced Multi-modal Grounding Model

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework

MR-MLLM: Mutual Reinforcement of Multimodal Comprehension and Vision Perception

RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models

EventGPT: Event Stream Understanding with Multimodal Large Language Models

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

Large Language Models for Captioning and Retrieving Remote Sensing Images

Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs

Vision-Language Models in Remote Sensing: Current progress and future trends