Abstract:Automatically and rapidly understanding Earth's surface is fundamental to our grasp of the living environment and informed decision-making. This underscores the need for a unified system with comprehensive capabilities in analyzing Earth's surface to address a wide range of human needs. The emergence of multimodal large language models (MLLMs) has great potential in boosting the efficiency and convenience of intelligent Earth observation. These models can engage in human-like conversations, serve as unified platforms for understanding images, follow diverse instructions, and provide insightful feedbacks. In this study, we introduce LHRS-Bot-Nova, an MLLM specialized in understanding remote sensing (RS) images, designed to expertly perform a wide range of RS understanding tasks aligned with human instructions. LHRS-Bot-Nova features an enhanced vision encoder and a novel bridge layer, enabling efficient visual compression and better language-vision alignment. To further enhance RS-oriented vision-language alignment, we propose a large-scale RS image-caption dataset, generated through feature-guided image recaptioning. Additionally, we introduce an instruction dataset specifically designed to improve spatial recognition abilities. Extensive experiments demonstrate superior performance of LHRS-Bot-Nova across various RS image understanding tasks. We also evaluate different MLLM performances in complex RS perception and instruction following using a complicated multi-choice question evaluation benchmark, providing a reliable guide for future model selection and improvement. Data, code, and models will be available at <a class="link-external link-https" href="https://github.com/NJU-LHRS/LHRS-Bot" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the following aspects: 1. **Lack of high - quality large - scale image - caption datasets**: The datasets currently used in visual - language pre - training of remote - sensing multi - modal large language models (MLLMs) often have problems such as high noise, lack of rich information, poor semantics, low sentence diversity, and excessive focus on salient objects. These problems limit the model's effective modal alignment ability. 2. **Weak spatial recognition ability and hallucination tendency**: Existing remote - sensing MLLMs have low accuracy in spatial positioning and are prone to generate hallucination responses, that is, give incorrect answers when facing problems beyond their capabilities. 3. **Challenges in comprehensively evaluating MLLMs**: Although these models perform well in common tasks such as classification, visual question answering, and visual localization, the existing evaluation metrics cannot comprehensively reflect their abilities in complex scene understanding, object attribute recognition, spatial relationship recognition, and following human instructions. To address the above problems, the paper proposes LHRS - Bot - Nova, an improved multi - modal large language model specifically for understanding and interpreting remote - sensing images. LHRS - Bot - Nova solves the above problems in the following ways: - **Constructing a high - quality large - scale image - caption dataset**: The paper proposes a feature - guided image re - captioning method and generates a new dataset named LHRS - Align - Recap, which has a richer vocabulary, more diverse sentence structures, and stronger visual - language alignment quality. - **Enhancing spatial recognition ability and reducing hallucinations**: By expanding the LHRS - Instruct dataset, adding more conversations about localization and perception, and introducing an off - the - shelf robust visual instruction dataset containing a large number of negative samples to balance the dataset and reduce the possibility of the model generating hallucinations. - **Optimizing the model architecture**: The paper designs an enhanced visual encoder and a new visual perceptron based on the MoE structure, which improves the model's visual - language alignment performance and visual understanding ability. - **Systematic evaluation**: The paper not only evaluates the performance of LHRS - Bot - Nova on standard remote - sensing tasks but also uses a multiple - choice question evaluation benchmark (LHRS - Bench) to comprehensively evaluate the model's instruction - following ability and other remote - sensing - specific abilities from multiple dimensions. Through these improvements, LHRS - Bot - Nova shows excellent performance in various remote - sensing image understanding tasks, especially making significant progress in spatial recognition and reducing hallucinations.

LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation

LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain

RSUniVLM: A Unified Vision Language Model for Remote Sensing via Granularity-oriented Mixture of Experts

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing

RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery

From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing

EarthMarker: A Visual Prompting Multi-modal Large Language Model for Remote Sensing

RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models

EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain

RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model

RS-Agent: Automating Remote Sensing Tasks through Intelligent Agents

RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding

Large Language Models for Captioning and Retrieving Remote Sensing Images

RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models

Multimodal Remote Sensing Scene Classification Using VLMs and Dual-Cross Attention Networks

RSGPT: A Remote Sensing Vision Language Model and Benchmark

SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model

Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques

LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge

Vision-Language Models in Remote Sensing: Current progress and future trends