Abstract:Remote sensing change detection aims to compare two or more images recorded for the same area but taken at different time stamps to quantitatively and qualitatively assess changes in geographical entities and environmental factors. Mainstream models usually built on pixel-by-pixel change detection paradigms, which cannot tolerate the diversity of changes due to complex scenes and variation in imaging conditions. To address this shortcoming, this paper rethinks the change detection with the mask view, and further proposes the corresponding: 1) meta-architecture CDMask and 2) instance network CDMaskFormer. Components of CDMask include Siamese backbone, change extractor, pixel decoder, transformer decoder and normalized detector, which ensures the proper functioning of the mask detection paradigm. Since the change query can be adaptively updated based on the bi-temporal feature content, the proposed CDMask can adapt to different latent data distributions, thus accurately identifying regions of interest changes in complex scenarios. Consequently, we further propose the instance network CDMaskFormer customized for the change detection task, which includes: (i) a Spatial-temporal convolutional attention-based instantiated change extractor to capture spatio-temporal context simultaneously with lightweight operations; and (ii) a scene-guided axial attention-instantiated transformer decoder to extract more spatial details. State-of-the-art performance of CDMaskFormer is achieved on five benchmark datasets with a satisfactory efficiency-accuracy trade-off. Code is available at <a class="link-external link-https" href="https://github.com/xwmaxwma/rschange" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the deficiencies of existing remote - sensing change detection methods in dealing with complex scenes and changes in imaging conditions. Specifically, the traditional pixel - level change detection paradigm (CDPixel) relies on fixed semantic prototypes to detect changes of interest, which causes it to be unable to tolerate different data distributions generated by the diversity of complex scenes and imaging conditions (such as frequent irrelevant changes caused by weather, lighting, seasons, and human activities). Therefore, this paper re - thinks the change detection task and proposes a mask - view - based method to more flexibly adapt to different data distributions, thereby accurately identifying area changes of interest in complex scenes. ### Main contributions of the paper 1. **Analysis of the deficiencies of the existing pixel - level change detection paradigm**: - Proposed a method for generating adaptive change masks through learnable change queries. To the best of the authors' knowledge, this is the first change detection method based on mask classification, providing a new paradigm for the design of subsequent work. 2. **Propose the meta - architecture CDMask**: - CDMask only needs a small number of modifications to be compatible with various state - of - the - art DETR frameworks. In particular, a Normalized Detector is designed, which is a key component for CDMask to work properly. 3. **Propose the instance network CDMaskFormer**: - The components of CDMaskFormer are highly customizable, including the novel spatio - temporal convolutional attention mechanism and the scene - guided axial attention mechanism, which are used to instantiate the change extractor and the transformer decoder. These designs enable CDMaskFormer to achieve better performance than previous state - of - the - art models on five RSCD benchmark datasets and have high computational efficiency. ### Method overview #### 3.1 CDMask - **Siamese Backbone**: - Use a weight - shared Siamese backbone network to extract bi - temporal features, and adopt a hierarchical backbone network to better model multi - scale geospatial objects. - **Change Extractor**: - Fuse bi - temporal features to generate high - quality change representations to update change queries. Multiple fusion methods can be used, such as concatenation, element - level subtraction, dense connection, cross - attention, and state - space modeling. - **Pixel Decoder**: - Extract multi - scale features to allow accurate detection of change regions. Common pixel decoder designs such as deformable attention and feature pyramids can be directly compatible with CDMask. - **Transformer Decoder**: - Introduce learnable change queries to interact with change representations to adaptively generate change masks. Directly compatible with the existing transformer decoder designs of DETRs, but with a smaller number of change queries. - **Normalized Detector**: - Redesign the detector so that it can correctly determine the category of each target pixel, thereby adapting to the RSCD task. Map the output values of the change channels between 0 and 1 through min - max normalization, and detect changes based on a fixed threshold. #### 3.2 CDMaskFormer - **Spatial - temporal Convolutional Attention**: - Propose a lightweight change extractor that simultaneously captures the context in the spatio - temporal range through inexpensive convolutional attention, selectively enhancing the changes in the regions of interest and suppressing the interference of irrelevant changes. - **Scene - guided Axial Attention**: - Through the scene - guided axial attention mechanism, mine more detailed information from high - resolution change representations. This mechanism is especially suitable for capturing the features of strip - like objects (such as roads, rivers, and buildings) and reducing computational consumption. ### Experimental results - **Performance comparison**: - On multiple benchmark datasets (such as DSIFN - CD, CLCD, SYSU - CD, LEVIR - CD, and WHU - CD), CDMaskFormer has achieved the best performance, especially excellent in F1 - score, precision, recall, IoU, and OA metrics. Through these innovations, CDMask and CD

Rethinking Remote Sensing Change Detection With A Mask View

MaskCD: A Remote Sensing Change Detection Network Based on Mask Classification

ChangeMask: Deep Multi-Task Encoder-Transformer-decoder Architecture for Semantic Change Detection

Mask-CDNet:a mask based pixel change detection network

Mask-CDNet

Mask-Guided Local–Global Attentive Network for Change Detection in Remote Sensing Images

Fine-Grained High-Resolution Remote Sensing Image Change Detection by SAM-UNet Change Detection Model

Mask Approximation Net: Merging Feature Extraction and Distribution Learning for Remote Sensing Change Captioning

Change-Aware Cascaded Dual-Decoder Network for Remote Sensing Image Change Detection

Robust change detection for remote sensing images based on temporospatial interactive attention module

3-D Neighborhood Cross-Differencing: A New Paradigm Serves Remote Sensing Change Detection

3D Neighborhood Cross Differencing: A New Paradigm Serves Remote Sensing Change Detection

EfficientCD: A New Strategy For Change Detection Based With Bi-temporal Layers Exchanged

Unsupervised Multimodal Change Detection by Distilling Common and Discrepant Representations

Global-Local Collaborative Learning Network for Optical Remote Sensing Image Change Detection

SMD-Net: Siamese Multi-Scale Difference-Enhancement Network for Change Detection in Remote Sensing

Change Detection for High-Resolution Remote Sensing Images Based on a Multi-Scale Attention Siamese Network

MCCRNet: A Multi-Level Change Contextual Refinement Network for Remote Sensing Image Change Detection

Remote Sensing Semantic Change Detection Model for Improving Objects Completeness

SRC-Net: Bi-Temporal Spatial Relationship Concerned Network for Change Detection

Domain Adaptive and Interactive Differential Attention Network for Remote Sensing Image Change Detection