Abstract:Large multimodal models (LMMs) have shown encouraging performance in the natural image domain using visual instruction tuning. However, these LMMs struggle to describe the content of remote sensing images for tasks such as image or region grounding, classification, etc. Recently, GeoChat make an effort to describe the contents of the RS images. Although, GeoChat achieves promising performance for various RS tasks, it struggles to describe the changes between bi-temporal RS images which is a key RS task. This necessitates the development of an LMM that can describe the changes between the bi-temporal RS images. However, there is insufficiency of datasets that can be utilized to tune LMMs. In order to achieve this, we introduce a change description instruction dataset that can be utilized to finetune an LMM and provide better change descriptions for RS images. Furthermore, we show that the LLaVA-1.5 model, with slight modifications, can be finetuned on the change description instruction dataset and achieve favorably better performance.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is that existing large - scale multimodal models (LMMs) perform poorly in describing the changes between remote sensing (RS) images. Specifically: 1. **Limitations of existing models**: - Although existing LMMs perform well in the field of natural images, they have difficulties in describing the content of remote sensing images, especially in tasks such as image or region localization and classification. - Although the GeoChat model has achieved good performance in various remote sensing tasks, it performs poorly in describing the semantic changes between bi - temporal remote sensing images. 2. **Lack of appropriate datasets**: - The remote sensing field lacks sufficient multimodal dialogue datasets for instruction tuning, making it difficult for models to learn how to accurately describe image changes. - The change detection (CD) task requires paired images and text descriptions, and most existing remote sensing datasets lack such paired data. 3. **Research objectives**: - To overcome the above problems, the author proposes a new multimodal model CDChat, which is specifically used to describe the changes in remote sensing images. - The author creates a change - description - instruction dataset for fine - tuning LMM to improve its performance in the remote sensing - image - change - description task. ### Specific problem summary - **Describing changes**: Existing LMMs cannot well describe the changes between bi - temporal remote sensing images. - **Insufficient datasets**: There is a lack of multimodal dialogue datasets suitable for instruction tuning, especially datasets for change - detection tasks. - **Performance improvement**: A new method and dataset are required to improve the performance of LMM in the remote sensing - image - change - description task. By solving these problems, CDChat aims to provide more accurate and detailed descriptions of remote sensing - image changes, thereby promoting the development of multimodal models in the remote sensing field.

CDChat: A Large Multimodal Model for Remote Sensing Change Description

ChangeChat: An Interactive Model for Remote Sensing Change Analysis via Multimodal Instruction Tuning

GeoChat: Grounded Large Vision-Language Model for Remote Sensing

CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset

LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation

Towards a multimodal framework for remote sensing image change retrieval and captioning

Change-Agent: Towards Interactive Comprehensive Remote Sensing Change Interpretation and Analysis

ChangeMinds: Multi-task Framework for Detecting and Describing Changes in Remote Sensing

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing

ChangeCLIP: Remote sensing change detection with multimodal vision-language representation learning

Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance

LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning

CaMML: Context-Aware Multimodal Learner for Large Models

Pixel-Level Change Detection Pseudo-Label Learning for Remote Sensing Change Captioning

Multi-Scale Feature Interaction Network for Remote Sensing Change Detection

MGIMM: Multi-Granularity Instruction Multimodal Model for Attribute-Guided Remote Sensing Image Detailed Description

RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery

Intertemporal Interaction and Symmetric Difference Learning for Remote Sensing Image Change Captioning

A Novel Adaptive Fine-Tuning Algorithm for Multimodal Models: Self-Optimizing Classification and Selection of High-Quality Datasets in Remote Sensing

Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model