Abstract:Remote sensing image change captioning (RSICC) aims to automatically generate sentences that describe content differences in remote sensing bitemporal images. Recently, attention-based transformers have become a prevalent idea for capturing the features of global change. However, existing transformer-based RSICC methods face challenges, e.g., high parameters and high computational complexity caused by the self-attention operation in the transformer encoder component. To alleviate these issues, this paper proposes a Sparse Focus Transformer (SFT) for the RSICC task. Specifically, the SFT network consists of three main components, i.e. a high-level features extractor based on a convolutional neural network (CNN), a sparse focus attention mechanism-based transformer encoder network designed to locate and capture changing regions in dual-temporal images, and a description decoder that embeds images and words to generate sentences for captioning differences. The proposed SFT network can reduce the parameter number and computational complexity by incorporating a sparse attention mechanism within the transformer encoder network. Experimental results on various datasets demonstrate that even with a reduction of over 90\% in parameters and computational complexity for the transformer encoder, our proposed network can still obtain competitive performance compared to other state-of-the-art RSICC methods. The code is available at \href{<a class="link-external link-https" href="https://github.com/sundongwei/SFT_chag2cap" rel="external noopener nofollow">this https URL</a>}{Lite\_Chag2cap}.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the high number of parameters and high computational complexity in Remote Sensing Image Change Captioning (RSICC). Specifically, when existing Transformer - based methods are used to handle the change captioning task of bi - temporal remote sensing images, due to the self - attention mechanism, the problems of a large number of model parameters and overly high computational complexity are caused. These problems not only increase the training and inference costs of the model but also limit its deployment ability in industrial applications, especially in environments with limited computing resources. To alleviate these problems, this paper proposes a lightweight Sparse Focus Transformer (SFT) network. The SFT network significantly reduces the number of parameters and computational complexity by introducing the sparse - attention mechanism while maintaining the ability to efficiently capture and describe the change regions. Specific contributions are as follows: 1. **Adapting to Sparse - Factorized Attention Matrices**: The method of sparse - factorized attention matrices used in natural language processing for generating long - sequence texts is applied to the remote sensing image change detection task, aiming to establish a sparse - attention mechanism to locate the change regions. 2. **Constructing the Sparse Focus Transformer**: A sparse focus Transformer is designed for the remote sensing image change captioning task, which significantly reduces the redundancy in the multimodal model and improves the effect of modal representation and fusion. 3. **Experimental Verification**: Extensive verification has been carried out on multiple datasets, and the results show that even when the number of parameters and computational complexity are reduced by more than 90%, the proposed network can still compete with the current state - of - the - art RSICC methods, showing the advantages of high precision and low complexity. Through these improvements, the SFT network can not only generate accurate description texts but also significantly reduce the number of parameters and computational complexity of the model, thereby better meeting the requirements of practical application scenarios.

A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning

Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset

SBAT: Video Captioning with Sparse Boundary-Aware Transformer

Improving Remote Sensing Image Captioning by Combining Grid Features and Transformer

A Multiscale Grouping Transformer With CLIP Latents for Remote Sensing Image Captioning

A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning

Intertemporal Interaction and Symmetric Difference Learning for Remote Sensing Image Change Captioning

Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model

TypeFormer: Multiscale Transformer With Type Controller for Remote Sensing Image Caption

Progressive Scale-aware Network for Remote sensing Image Change Captioning

Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion

Single-Stream Extractor Network With Contrastive Pre-Training for Remote-Sensing Change Captioning

Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance

Cross-Spatial Pixel Integration and Cross-Stage Feature Fusion Based Transformer Network for Remote Sensing Image Super-Resolution

Changes to Captions: An Attentive Network for Remote Sensing Change Captioning

Cooperative Connection Transformer for Remote Sensing Image Captioning

TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning

MV-CC: Mask Enhanced Video Model for Remote Sensing Change Caption

Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

Diffusion-RSCC: Diffusion Probabilistic Model for Change Captioning in Remote Sensing Images

CSTSUNet: A Cross Swin Transformer-Based Siamese U-Shape Network for Change Detection in Remote Sensing Images