Abstract:Multimodal large language models (MLLMs) have demonstrated remarkable success in vision and visual-language tasks within the natural image domain. Owing to the significant domain gap between natural and remote sensing (RS) images, the development of MLLMs in the RS domain is still in the infant stage. To fill the gap, a pioneer MLLM named EarthGPT integrating various multisensor RS interpretation tasks uniformly is proposed in this article for universal RS image comprehension. First, a visual-enhanced perception mechanism is constructed to refine and incorporate coarse-scale semantic perception information and fine-scale detailed perception information. Second, a cross-modal mutual comprehension approach is proposed, aiming at enhancing the interplay between visual perception and language comprehension and deepening the comprehension of both visual and language content. Finally, a unified instruction tuning method for multisensor multitasking in the RS domain is proposed to unify a wide range of tasks including scene classification, image captioning, region-level captioning, visual question answering (VQA), visual grounding, and object detection. More importantly, a dataset named MMRS-1M featuring large-scale multisensor multimodal RS instruction-following is constructed, comprising over 1M image-text pairs based on 34 existing diverse RS datasets and including multisensor images such as optical, synthetic aperture radar (SAR), and infrared. The MMRS-1M dataset addresses the drawback of MLLMs on RS expert knowledge and stimulates the development of MLLMs in the RS domain. Extensive experiments are conducted, demonstrating the EarthGPT's superior performance in various RS visual interpretation tasks compared with the other specialist models and MLLMs, proving the effectiveness of the proposed EarthGPT and offering a versatile paradigm for open-set reasoning tasks. Our code and dataset are available at https://github.com/wivizhang/EarthGPT.

Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs

On the Promises and Challenges of Multimodal Foundation Models for Geographical, Environmental, Agricultural, and Urban Planning Applications

GeoLLM: Extracting Geospatial Knowledge from Large Language Models

LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild

Visualization Literacy of Multimodal Large Language Models: A Comparative Study

GPT4GEO: How a Language Model Sees the World's Geography

A Survey on Multimodal Large Language Models

Are Large Language Models Geospatially Knowledgeable?

G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model

An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models

LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain

Improving Multimodal LLMs Ability In Geometry Problem Solving, Reasoning, And Multistep Scoring

GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models

Exploring and Improving the Spatial Reasoning Abilities of Large Language Models

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

LLMMaps -- A Visual Metaphor for Stratified Evaluation of Large Language Models

Tell Me Where You Are: Multimodal LLMs Meet Place Recognition

The Implementation of Multimodal Large Language Models for Hydrological Applications: A Comparative Study of GPT-4 Vision, Gemini, LLaVa, and Multimodal-GPT

Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering

Enhancing the Spatial Awareness Capability of Multi-Modal Large Language Model