CDChat: A Large Multimodal Model for Remote Sensing Change Description

Mubashir Noman,Noor Ahsan,Muzammal Naseer,Hisham Cholakkal,Rao Muhammad Anwer,Salman Khan,Fahad Shahbaz Khan
2024-09-25
Abstract:Large multimodal models (LMMs) have shown encouraging performance in the natural image domain using visual instruction tuning. However, these LMMs struggle to describe the content of remote sensing images for tasks such as image or region grounding, classification, etc. Recently, GeoChat make an effort to describe the contents of the RS images. Although, GeoChat achieves promising performance for various RS tasks, it struggles to describe the changes between bi-temporal RS images which is a key RS task. This necessitates the development of an LMM that can describe the changes between the bi-temporal RS images. However, there is insufficiency of datasets that can be utilized to tune LMMs. In order to achieve this, we introduce a change description instruction dataset that can be utilized to finetune an LMM and provide better change descriptions for RS images. Furthermore, we show that the LLaVA-1.5 model, with slight modifications, can be finetuned on the change description instruction dataset and achieve favorably better performance.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that existing large - scale multimodal models (LMMs) perform poorly in describing the changes between remote sensing (RS) images. Specifically: 1. **Limitations of existing models**: - Although existing LMMs perform well in the field of natural images, they have difficulties in describing the content of remote sensing images, especially in tasks such as image or region localization and classification. - Although the GeoChat model has achieved good performance in various remote sensing tasks, it performs poorly in describing the semantic changes between bi - temporal remote sensing images. 2. **Lack of appropriate datasets**: - The remote sensing field lacks sufficient multimodal dialogue datasets for instruction tuning, making it difficult for models to learn how to accurately describe image changes. - The change detection (CD) task requires paired images and text descriptions, and most existing remote sensing datasets lack such paired data. 3. **Research objectives**: - To overcome the above problems, the author proposes a new multimodal model CDChat, which is specifically used to describe the changes in remote sensing images. - The author creates a change - description - instruction dataset for fine - tuning LMM to improve its performance in the remote sensing - image - change - description task. ### Specific problem summary - **Describing changes**: Existing LMMs cannot well describe the changes between bi - temporal remote sensing images. - **Insufficient datasets**: There is a lack of multimodal dialogue datasets suitable for instruction tuning, especially datasets for change - detection tasks. - **Performance improvement**: A new method and dataset are required to improve the performance of LMM in the remote sensing - image - change - description task. By solving these problems, CDChat aims to provide more accurate and detailed descriptions of remote sensing - image changes, thereby promoting the development of multimodal models in the remote sensing field.