Abstract:Remote sensing image change captioning (RSICC) has received considerable research interest due to its ability of automatically providing meaningful sentences describing the changes in remote sensing (RS) images. Existing RSICC methods mainly utilize pre-trained networks on natural image datasets to extract feature representations. This degrades performance since aerial images possess distinctive characteristics compared to natural images. In addition, it is challenging to capture the data distribution and perceive contextual information between samples, resulting in limited robustness and generalization of the feature representations. Furthermore, their focus on inherent most change-aware discriminative information is insufficient by directly aggregating all features. To deal with these problems, a novel framework entitled Multi-Attentive network with Diffusion model for RSICC (MADiffCC) is proposed in this work. Specifically, we introduce a diffusion feature extractor based on RS image dataset pre-trained diffusion model to capture the multi-level and multi-time-step feature representations of bitemporal RS images. The diffusion model is able to learn the training data distribution and contextual information of RS objects from which more robust and generalized representations could be extracted for the downstream application of change captioning. Furthermore, a time-channel-spatial attention (TCSA) mechanism based difference encoder is designed to utilize the extracted diffusion features to obtain the discriminative information. A gated multi-head cross-attention (GMCA)-guided change captioning decoder is then proposed to select and fuse crucial hierarchical features for more precise change description generation. Experimental results on the publicly available LEVIR-CC, LEVIRCCD, and DUBAI-CC datasets verify that the developed approach could realize state-of-the-art (SOTA) performance.

HCNet: Hierarchical Feature Aggregation and Cross-Modal Feature Alignment for Remote Sensing Image Captioning

Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer

Cooperative Connection Transformer for Remote Sensing Image Captioning

A Deep Semantic Alignment Network for the Cross-Modal Image-Text Retrieval in Remote Sensing

Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion

Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance

Hybridizing Cross-Level Contextual and Attentive Representations for Remote Sensing Imagery Semantic Segmentation

TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning

NWPU-Captions Dataset and MLCA-Net for Remote Sensing Image Captioning

Multielement Feature-Based Hierarchical Context Integration Network for Remote Sensing Image Segmentation

Intertemporal Interaction and Symmetric Difference Learning for Remote Sensing Image Change Captioning

A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning

A Hierarchical Consensus Attention Network for Feature Matching of Remote Sensing Images

Single-Stream Extractor Network With Contrastive Pre-Training for Remote-Sensing Change Captioning

Multi-Content Complementation Network for Salient Object Detection in Optical Remote Sensing Images

Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model

IC3: Image Captioning by Committee Consensus

SCAttNet: Semantic Segmentation Network with Spatial and Channel Attention Mechanism for High-Resolution Remote Sensing Images

Bidirectional interactive alignment network for image captioning