A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning

Dongwei Sun,Yajie Bao,Junmin Liu,Xiangyong Cao
DOI: https://doi.org/10.1109/JSTARS.2024.3471625
2024-10-11
Abstract:Remote sensing image change captioning (RSICC) aims to automatically generate sentences that describe content differences in remote sensing bitemporal images. Recently, attention-based transformers have become a prevalent idea for capturing the features of global change. However, existing transformer-based RSICC methods face challenges, e.g., high parameters and high computational complexity caused by the self-attention operation in the transformer encoder component. To alleviate these issues, this paper proposes a Sparse Focus Transformer (SFT) for the RSICC task. Specifically, the SFT network consists of three main components, i.e. a high-level features extractor based on a convolutional neural network (CNN), a sparse focus attention mechanism-based transformer encoder network designed to locate and capture changing regions in dual-temporal images, and a description decoder that embeds images and words to generate sentences for captioning differences. The proposed SFT network can reduce the parameter number and computational complexity by incorporating a sparse attention mechanism within the transformer encoder network. Experimental results on various datasets demonstrate that even with a reduction of over 90\% in parameters and computational complexity for the transformer encoder, our proposed network can still obtain competitive performance compared to other state-of-the-art RSICC methods. The code is available at \href{<a class="link-external link-https" href="https://github.com/sundongwei/SFT_chag2cap" rel="external noopener nofollow">this https URL</a>}{Lite\_Chag2cap}.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the high number of parameters and high computational complexity in Remote Sensing Image Change Captioning (RSICC). Specifically, when existing Transformer - based methods are used to handle the change captioning task of bi - temporal remote sensing images, due to the self - attention mechanism, the problems of a large number of model parameters and overly high computational complexity are caused. These problems not only increase the training and inference costs of the model but also limit its deployment ability in industrial applications, especially in environments with limited computing resources. To alleviate these problems, this paper proposes a lightweight Sparse Focus Transformer (SFT) network. The SFT network significantly reduces the number of parameters and computational complexity by introducing the sparse - attention mechanism while maintaining the ability to efficiently capture and describe the change regions. Specific contributions are as follows: 1. **Adapting to Sparse - Factorized Attention Matrices**: The method of sparse - factorized attention matrices used in natural language processing for generating long - sequence texts is applied to the remote sensing image change detection task, aiming to establish a sparse - attention mechanism to locate the change regions. 2. **Constructing the Sparse Focus Transformer**: A sparse focus Transformer is designed for the remote sensing image change captioning task, which significantly reduces the redundancy in the multimodal model and improves the effect of modal representation and fusion. 3. **Experimental Verification**: Extensive verification has been carried out on multiple datasets, and the results show that even when the number of parameters and computational complexity are reduced by more than 90%, the proposed network can still compete with the current state - of - the - art RSICC methods, showing the advantages of high precision and low complexity. Through these improvements, the SFT network can not only generate accurate description texts but also significantly reduce the number of parameters and computational complexity of the model, thereby better meeting the requirements of practical application scenarios.