Abstract:Change detection (CD) in remote sensing (RS) images is a critical task that has achieved significant success by deep learning. Current networks often employ pixel-based differencing, proportion, classification-based, or feature concatenation methods to represent changes of interest. However, these methods fail to effectively detect the desired changes, as they are highly sensitive to factors such as atmospheric conditions, lighting variations, and phenological variations, resulting in detection errors. Inspired by the transformer structure, we adopt a cross-attention mechanism to more robustly extract feature differences between bitemporal images. The motivation of the method is based on the assumption that if there is no change between image pairs, the semantic features from one temporal image can well be represented by the semantic features from another temporal image. Conversely if there is a change, there are significant reconstruction errors. Therefore, a Cross Swin transformer-based Siamese U-shaped network namely CSTSUNet is proposed for RS CD. CSTSUnet consists of encoder, difference feature extraction, and decoder. The encoder is based on a hierarchical residual network (ResNet) with the Siamese U-net structure, allowing parallel processing of bitemporal images and extraction of multiscale features. The difference feature extraction consists of four difference feature extraction modules that compute difference feature at multiple scales. In this module, Cross Swin transformer is employed in each difference feature extraction module to communicate the information of bitemporal images. The decoder takes in the multiscale difference features as input, injects details and boundaries iteratively level by level, and makes the change map more and more accurate. We conduct experiments on three public datasets, and the experimental results demonstrate that the proposed CSTSUNet outperforms other state-of-the-art methods in terms of both qualitative and quantitative analyses. Our code is available at https://github.com/l7170/CSTSUNet.git.

Single-Stream Extractor Network With Contrastive Pre-Training for Remote-Sensing Change Captioning

Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model

Progressive Scale-aware Network for Remote sensing Image Change Captioning

Intertemporal Interaction and Symmetric Difference Learning for Remote Sensing Image Change Captioning

Remote Sensing Image Change Captioning With Dual-Branch Transformers: A New Method and a Large Scale Dataset

Semantic-CC: Boosting Remote Sensing Image Change Captioning via Foundational Knowledge and Semantic Guidance

Changes to Captions: An Attentive Network for Remote Sensing Change Captioning

Towards a multimodal framework for remote sensing image change retrieval and captioning

TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning

A Lightweight Sparse Focus Transformer for Remote Sensing Image Change Captioning

Cross-Modal Retrieval and Semantic Refinement for Remote Sensing Image Captioning

Diffusion-RSCC: Diffusion Probabilistic Model for Change Captioning in Remote Sensing Images

Pixel-Level Change Detection Pseudo-Label Learning for Remote Sensing Change Captioning

CCExpert: Advancing MLLM Capability in Remote Sensing Change Captioning with Difference-Aware Integration and a Foundational Dataset

Multi-Source Interactive Stair Attention for Remote Sensing Image Captioning

MV-CC: Mask Enhanced Video Model for Remote Sensing Change Caption

Enhancing Perception of Key Changes in Remote Sensing Image Change Captioning

Relation-aware Multi-pass Comparison Deconfounded Network for Change Captioning

CSTSUNet: A Cross Swin Transformer-Based Siamese U-Shape Network for Change Detection in Remote Sensing Images

Semantic-Explicit Filtering Network for Remote Sensing Image Change Detection

Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion