Abstract:Remote sensing image change captioning (RSICC) has received considerable research interest due to its ability of automatically providing meaningful sentences describing the changes in remote sensing (RS) images. Existing RSICC methods mainly utilize pre-trained networks on natural image datasets to extract feature representations. This degrades performance since aerial images possess distinctive characteristics compared to natural images. In addition, it is challenging to capture the data distribution and perceive contextual information between samples, resulting in limited robustness and generalization of the feature representations. Furthermore, their focus on inherent most change-aware discriminative information is insufficient by directly aggregating all features. To deal with these problems, a novel framework entitled Multi-Attentive network with Diffusion model for RSICC (MADiffCC) is proposed in this work. Specifically, we introduce a diffusion feature extractor based on RS image dataset pre-trained diffusion model to capture the multi-level and multi-time-step feature representations of bitemporal RS images. The diffusion model is able to learn the training data distribution and contextual information of RS objects from which more robust and generalized representations could be extracted for the downstream application of change captioning. Furthermore, a time-channel-spatial attention (TCSA) mechanism based difference encoder is designed to utilize the extracted diffusion features to obtain the discriminative information. A gated multi-head cross-attention (GMCA)-guided change captioning decoder is then proposed to select and fuse crucial hierarchical features for more precise change description generation. Experimental results on the publicly available LEVIR-CC, LEVIRCCD, and DUBAI-CC datasets verify that the developed approach could realize state-of-the-art (SOTA) performance.

Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset

A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning

Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model

Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance

Intertemporal Interaction and Symmetric Difference Learning for Remote Sensing Image Change Captioning

Towards a multimodal framework for remote sensing image change retrieval and captioning

Progressive Scale-aware Network for Remote sensing Image Change Captioning

A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning

Diffusion-RSCC: Diffusion Probabilistic Model for Change Captioning in Remote Sensing Images

Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer

Single-Stream Extractor Network With Contrastive Pre-Training for Remote-Sensing Change Captioning

Changes to Captions: An Attentive Network for Remote Sensing Change Captioning

MV-CC: Mask Enhanced Video Model for Remote Sensing Change Caption

TypeFormer: Multiscale Transformer With Type Controller for Remote Sensing Image Caption

Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset

Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion

RSCaMa: Remote Sensing Image Change Captioning with State Space Model

Describing and Localizing Multiple Changes with Transformers

Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning

Pixel-Level Change Detection Pseudo-Label Learning for Remote Sensing Change Captioning