Abstract:Remote sensing image change captioning (RSICC) aims to articulate the changes in objects of interest within bi-temporal remote sensing images using natural language. Given the limitations of current RSICC methods in expressing general features across multi-temporal and spatial scenarios, and their deficiency in providing granular, robust, and precise change descriptions, we introduce a novel change captioning (CC) method based on the foundational knowledge and semantic guidance, which we term Semantic-CC. Semantic-CC alleviates the dependency of high-generalization algorithms on extensive annotations by harnessing the latent knowledge of foundation models, and it generates more comprehensive and accurate change descriptions guided by pixel-level semantics from change detection (CD). Specifically, we propose a bi-temporal SAM-based encoder for dual-image feature extraction; a multi-task semantic aggregation neck for facilitating information interaction between heterogeneous tasks; a straightforward multi-scale change detection decoder to provide pixel-level semantic guidance; and a change caption decoder based on the large language model (LLM) to generate change description sentences. Moreover, to ensure the stability of the joint training of CD and CC, we propose a three-stage training strategy that supervises different tasks at various stages. We validate the proposed method on the LEVIR-CC and LEVIR-CD datasets. The experimental results corroborate the complementarity of CD and CC, demonstrating that Semantic-CC can generate more accurate change descriptions and achieve optimal performance across both tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the limitations of current methods in expressing general features and providing fine - grained, robust and accurate change descriptions in the multi - temporal remote sensing image change captioning (RSICC) task. Specifically, existing RSICC methods perform poorly when dealing with general features in multi - temporal and spatial scenes, and also have deficiencies in describing change details. These problems mainly stem from an over - reliance on deep semantic information, which often ignores the subtle details in the images and is sometimes affected by semantic noise irrelevant to the change understanding task, such as changes in illumination angle and intensity. To overcome these challenges, the paper proposes a new change description method - Semantic - CC, which combines the knowledge of the base model and semantic guidance. Semantic - CC reduces the dependence on a large amount of labeled data by leveraging the latent knowledge of the base model and generates more comprehensive and accurate change descriptions through pixel - level semantic guidance from change detection (CD). The main contributions of the paper include: 1. **Introduction of Semantic - CC**: This is a change description method that combines the knowledge of the base model and semantic guidance, which can achieve high availability with minimal labeling and generate more fine - grained and accurate sentence descriptions. 2. **Bi - temporal SAM - based encoder**: Built on the latent knowledge of the SAM base model, it integrates a bi - temporal change semantic filter (BCSF) to fuse bi - temporal information. 3. **Multi - task semantic aggregation neck**: Promotes information interaction between different tasks, including an intra - task attention mechanism and an inter - task attention mechanism. 4. **Change detection decoder**: Provides pixel - level semantic guidance and generates a change segmentation map. 5. **Change captioning decoder**: Generates change - captioning sentences based on a large - language model (LLM), including a change semantic feature enhancer that generates bi - temporal difference features. 6. **Three - stage training strategy**: Ensures the stable joint training of change detection and change captioning tasks and prevents negative transfer in multi - task learning. Through these innovations, Semantic - CC can not only generate more accurate change descriptions, but also achieve optimal performance in both change detection and change captioning tasks.

Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance

Learning Consensus-Aware Semantic Knowledge for Remote Sensing Image Captioning

Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

Pixel-Level Change Detection Pseudo-Label Learning for Remote Sensing Change Captioning

A Decoupling Paradigm with Prompt Learning for Remote Sensing Image Change Captioning.

Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset

CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset

Detection Assisted Change Captioning for Remote Sensing Image

Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model

Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning

MV-CC: Mask Enhanced Video Model for Remote Sensing Change Caption

Remote Sensing Image Captioning with Sequential Attention and Flexible Word Correlation

Diffusion-RSCC: Diffusion Probabilistic Model for Change Captioning in Remote Sensing Images

Changes to Captions: An Attentive Network for Remote Sensing Change Captioning

Human-like Controllable Image Captioning with Verb-specific Semantic Roles

Semantic-Aware Alignment Network for Cross-Resolution Change Detection

Change Captioning for Satellite Images Time Series

Single-Stream Extractor Network With Contrastive Pre-Training for Remote-Sensing Change Captioning

Inter-Temporal Interaction and Symmetric Difference Learning for Remote Sensing Image Change Captioning

Intertemporal Interaction and Symmetric Difference Learning for Remote Sensing Image Change Captioning

MfrNet: A New Multi-Scale Feature Refining Method for Remote Sensing Image Change Captioning