MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description

Cong Yang,Zuchao Li,Lefei Zhang

2024-06-07

Abstract:Recently, large multimodal models have built a bridge from visual to textual information, but they tend to underperform in remote sensing scenarios. This underperformance is due to the complex distribution of objects and the significant scale differences among targets in remote sensing images, leading to visual ambiguities and insufficient descriptions by these multimodal models. Moreover, the lack of multimodal fine-tuning data specific to the remote sensing field makes it challenging for the model's behavior to align with user queries. To address these issues, this paper proposes an attribute-guided \textbf{Multi-Granularity Instruction Multimodal Model (MGIMM)} for remote sensing image detailed description. MGIMM guides the multimodal model to learn the consistency between visual regions and corresponding text attributes (such as object names, colors, and shapes) through region-level instruction tuning. Then, with the multimodal model aligned on region-attribute, guided by multi-grain visual features, MGIMM fully perceives both region-level and global image information, utilizing large language models for comprehensive descriptions of remote sensing images. Due to the lack of a standard benchmark for generating detailed descriptions of remote sensing images, we construct a dataset featuring 38,320 region-attribute pairs and 23,463 image-detailed description pairs. Compared with various advanced methods on this dataset, the results demonstrate the effectiveness of MGIMM's region-attribute guided learning approach. Code can be available at <a class="link-external link-https" href="https://github.com/yangcong356/MGIMM.git" rel="external noopener nofollow">this https URL</a>

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper aims to address the issues present in the detailed description of remote sensing images, particularly focusing on the limitations of current multimodal models in handling such images. The main issues include: 1. **Visual Ambiguity**: Due to the complex distribution of objects and significant differences in target sizes in remote sensing images, existing models find it difficult to accurately identify and describe the geographical features in the images. 2. **Insufficient Descriptions**: Existing multimodal models often generate short and single-sentence descriptions, failing to capture the detailed geographical information in the images. 3. **Lack of Domain Knowledge**: Most large multimodal models lack expertise specific to the remote sensing field, which limits their ability to accurately understand and describe the images. To address the above challenges, the paper proposes an attribute-guided multi-granularity instruction multimodal model (MGIMM), which solves these issues through the following methods: - **Region-Level Instruction Tuning**: Utilizing bounding boxes to guide the model in aligning geographical targets with their corresponding attribute descriptions, addressing the visual ambiguity caused by differences in target sizes. - **Image-Level Instruction Tuning**: Further training the model to understand the content of the entire image and generate detailed descriptions, fully leveraging the capability of large language models to generate long texts. Additionally, the paper constructs a new dataset DIOR-IDD, containing 38,320 region-attribute pairs and 23,463 image-detailed description pairs, for model training and evaluation. Experimental results show that MGIMM can effectively improve the quality and accuracy of remote sensing image descriptions.

MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

MGAN: Attempting a Multimodal Graph Attention Network for Remote Sensing Cross-Modal Text-Image Retrieval

Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models

From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing

Semantic Understandings for Aerial Images Via Multigrained Feature Grouping

Fine-Grained Information Supplementation and Value-Guided Learning for Remote Sensing Image-Text Retrieval

Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering

Adjacent-Atrous Mechanism for Expanding Global Receptive Fields: An End-to-End Network for Multiattribute Scene Analysis in Remote Sensing Imagery

LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation

Multi-modal Remote Sensing Image Description Based on Word Embedding and Self-Attention Mechanism

Interactive Masked Image Modeling for Multimodal Object Detection in Remote Sensing

A Multi-Modal High Spatial Resolution Aerial Imagery Scene Classification Model with Visual Enhancement

MGML: Multigranularity Multilevel Feature Ensemble Network for Remote Sensing Scene Classification

RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding

EarthGPT: A Universal Multimodal Large Language Model for Multisensor Image Comprehension in Remote Sensing Domain

Description Generation For Remote Sensing Images Using Attribute Attention Mechanism

SM3Det: A Unified Model for Multi-Modal Remote Sensing Object Detection

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

A Jointly Guided Deep Network for Fine-Grained Cross-Modal Remote Sensing Text–Image Retrieval

Integrating Multisubspace Joint Learning With Multilevel Guidance for Cross-Modal Retrieval of Remote Sensing Images