Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance

Yongshuo Zhu,Lu Li,Keyan Chen,Chenyang Liu,Fugen Zhou,Zhenwei Shi
2024-07-19
Abstract:Remote sensing image change captioning (RSICC) aims to articulate the changes in objects of interest within bi-temporal remote sensing images using natural language. Given the limitations of current RSICC methods in expressing general features across multi-temporal and spatial scenarios, and their deficiency in providing granular, robust, and precise change descriptions, we introduce a novel change captioning (CC) method based on the foundational knowledge and semantic guidance, which we term Semantic-CC. Semantic-CC alleviates the dependency of high-generalization algorithms on extensive annotations by harnessing the latent knowledge of foundation models, and it generates more comprehensive and accurate change descriptions guided by pixel-level semantics from change detection (CD). Specifically, we propose a bi-temporal SAM-based encoder for dual-image feature extraction; a multi-task semantic aggregation neck for facilitating information interaction between heterogeneous tasks; a straightforward multi-scale change detection decoder to provide pixel-level semantic guidance; and a change caption decoder based on the large language model (LLM) to generate change description sentences. Moreover, to ensure the stability of the joint training of CD and CC, we propose a three-stage training strategy that supervises different tasks at various stages. We validate the proposed method on the LEVIR-CC and LEVIR-CD datasets. The experimental results corroborate the complementarity of CD and CC, demonstrating that Semantic-CC can generate more accurate change descriptions and achieve optimal performance across both tasks.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the limitations of current methods in expressing general features and providing fine - grained, robust and accurate change descriptions in the multi - temporal remote sensing image change captioning (RSICC) task. Specifically, existing RSICC methods perform poorly when dealing with general features in multi - temporal and spatial scenes, and also have deficiencies in describing change details. These problems mainly stem from an over - reliance on deep semantic information, which often ignores the subtle details in the images and is sometimes affected by semantic noise irrelevant to the change understanding task, such as changes in illumination angle and intensity. To overcome these challenges, the paper proposes a new change description method - Semantic - CC, which combines the knowledge of the base model and semantic guidance. Semantic - CC reduces the dependence on a large amount of labeled data by leveraging the latent knowledge of the base model and generates more comprehensive and accurate change descriptions through pixel - level semantic guidance from change detection (CD). The main contributions of the paper include: 1. **Introduction of Semantic - CC**: This is a change description method that combines the knowledge of the base model and semantic guidance, which can achieve high availability with minimal labeling and generate more fine - grained and accurate sentence descriptions. 2. **Bi - temporal SAM - based encoder**: Built on the latent knowledge of the SAM base model, it integrates a bi - temporal change semantic filter (BCSF) to fuse bi - temporal information. 3. **Multi - task semantic aggregation neck**: Promotes information interaction between different tasks, including an intra - task attention mechanism and an inter - task attention mechanism. 4. **Change detection decoder**: Provides pixel - level semantic guidance and generates a change segmentation map. 5. **Change captioning decoder**: Generates change - captioning sentences based on a large - language model (LLM), including a change semantic feature enhancer that generates bi - temporal difference features. 6. **Three - stage training strategy**: Ensures the stable joint training of change detection and change captioning tasks and prevents negative transfer in multi - task learning. Through these innovations, Semantic - CC can not only generate more accurate change descriptions, but also achieve optimal performance in both change detection and change captioning tasks.