Abstract:Remote sensing image change captioning (RSICC) has received considerable research interest due to its ability of automatically providing meaningful sentences describing the changes in remote sensing (RS) images. Existing RSICC methods mainly utilize pre-trained networks on natural image datasets to extract feature representations. This degrades performance since aerial images possess distinctive characteristics compared to natural images. In addition, it is challenging to capture the data distribution and perceive contextual information between samples, resulting in limited robustness and generalization of the feature representations. Furthermore, their focus on inherent most change-aware discriminative information is insufficient by directly aggregating all features. To deal with these problems, a novel framework entitled Multi-Attentive network with Diffusion model for RSICC (MADiffCC) is proposed in this work. Specifically, we introduce a diffusion feature extractor based on RS image dataset pre-trained diffusion model to capture the multi-level and multi-time-step feature representations of bitemporal RS images. The diffusion model is able to learn the training data distribution and contextual information of RS objects from which more robust and generalized representations could be extracted for the downstream application of change captioning. Furthermore, a time-channel-spatial attention (TCSA) mechanism based difference encoder is designed to utilize the extracted diffusion features to obtain the discriminative information. A gated multi-head cross-attention (GMCA)-guided change captioning decoder is then proposed to select and fuse crucial hierarchical features for more precise change description generation. Experimental results on the publicly available LEVIR-CC, LEVIRCCD, and DUBAI-CC datasets verify that the developed approach could realize state-of-the-art (SOTA) performance.

Differential-Perceptive and Retrieval-Augmented MLLM for Change Captioning

Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning

CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset

Context-aware Difference Distilling for Multi-change Captioning

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

CompCap: Improving Multimodal Large Language Models with Composite Captions

Relation-aware Multi-pass Comparison Deconfounded Network for Change Captioning

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning

Image Difference Captioning With Instance-Level Fine-Grained Feature Representation

InfMLLM: A Unified Framework for Visual-Language Tasks.

Demonstrative Instruction Following in Multimodal LLMs Via Integrating Low-Rank Adaptation with Ensemble Learning

CoF: Coarse to Fine-Grained Image Understanding for Multi-modal Large Language Models

Robust Change Captioning

Semantic Relation-aware Difference Representation Learning for Change Captioning

Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model

Bidirectional difference locating and semantic consistency reasoning for change captioning

MfrNet: A New Multi-Scale Feature Refining Method for Remote Sensing Image Change Captioning

Improving Context Understanding in Multimodal Large Language Models Via Multimodal Composition Learning

RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models