LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

Dilxat Muhtar,Zhenshi Li,Feng Gu,Xueliang Zhang,Pengfeng Xiao
2024-07-16
Abstract:The revolutionary capabilities of large language models (LLMs) have paved the way for multimodal large language models (MLLMs) and fostered diverse applications across various specialized domains. In the remote sensing (RS) field, however, the diverse geographical landscapes and varied objects in RS imagery are not adequately considered in recent MLLM endeavors. To bridge this gap, we construct a large-scale RS image-text dataset, LHRS-Align, and an informative RS-specific instruction dataset, LHRS-Instruct, leveraging the extensive volunteered geographic information (VGI) and globally available RS images. Building on this foundation, we introduce LHRS-Bot, an MLLM tailored for RS image understanding through a novel multi-level vision-language alignment strategy and a curriculum learning method. Additionally, we introduce LHRS-Bench, a benchmark for thoroughly evaluating MLLMs' abilities in RS image understanding. Comprehensive experiments demonstrate that LHRS-Bot exhibits a profound understanding of RS images and the ability to perform nuanced reasoning within the RS domain.
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in the field of remote sensing (RS) image understanding, the existing multimodal large language models (MLLMs) fail to fully consider the diverse geographical landscapes and different objects in RS images, resulting in poor performance in RS image understanding tasks. Specifically, the paper points out: 1. **Complexity of RS images**: RS images contain complex geographical landscapes and multiple types of objects, and these objects vary greatly at different visual scales, which poses a challenge to overall image understanding. 2. **Limitations of existing datasets**: The existing public RS datasets fail to fully utilize RS features on a global scale, resulting in the inability to comprehensively inject RS visual knowledge into LLMs. 3. **Insufficient visual - language alignment**: Current methods mainly focus on high - level visual semantics and ignore the importance of different levels of visual information for achieving comprehensive alignment between vision and language. To solve these problems, the paper makes the following main contributions: 1. **Constructing a large - scale RS image - text dataset**: The authors construct a large - scale RS image - text dataset named LHRS - Align. By pairing RS images with geographical information in the OpenStreetMap (OSM) database, high - quality image - text pairs are generated. 2. **Creating an RS - specific instruction dataset**: The authors also create a multimodal instruction - following dataset named LHRS - Instruct for RS image understanding tasks, which contains complex visual reasoning data. 3. **Proposing an RS - specific MLLM**: Based on the above datasets, the authors propose an RS - specific MLLM named LHRS - Bot, which adopts a novel multi - level visual - language alignment strategy and curriculum learning method to improve the performance of RS image understanding. 4. **Establishing an evaluation benchmark in the RS field**: The authors construct an evaluation benchmark named LHRS - Bench for comprehensively evaluating MLLMs in the RS field, covering multiple evaluation dimensions and sub - categories. Through these contributions, the paper aims to improve the performance of MLLMs in RS image understanding tasks, especially in detecting complex objects, participating in human conversations, and extracting insights from RS images.