Abstract:In recent years, the application of semantic segmentation methods based on the remote sensing of images has become increasingly prevalent across a diverse range of domains, including but not limited to forest detection, water body detection, urban rail transportation planning, and building extraction. With the incorporation of the Transformer model into computer vision, the efficacy and accuracy of these algorithms have been significantly enhanced. Nevertheless, the Transformer model's high computational complexity and dependence on a pre-training weight of large datasets leads to a slow convergence during the training for remote sensing segmentation tasks. Motivated by the success of the adapter module in the field of natural language processing, this paper presents a novel adapter module (ResAttn) for improving the model training speed for remote sensing segmentation. The ResAttn adopts a dual-attention structure in order to capture the interdependencies between sets of features, thereby improving its global modeling capabilities, and introduces a Swin Transformer-like down-sampling method to reduce information loss and retain the original architecture while reducing the resolution. In addition, the existing Transformer model is limited in its ability to capture local high-frequency information, which can lead to an inadequate extraction of edge and texture features. To address these issues, this paper proposes a Local Feature Extractor (LFE) module, which is based on a convolutional neural network (CNN), and incorporates multi-scale feature extraction and residual structure to effectively overcome this limitation. Further, a mask-based segmentation method is employed and a residual-enhanced deformable attention block (Deformer Block) is incorporated to improve the small target segmentation accuracy. Finally, a sufficient number of experiments were performed on the ISPRS Potsdam datasets. The experimental results demonstrate the superior performance of the model described in this paper.

DSAT-Net: Dual Spatial Attention Transformer for Building Extraction From Aerial Images

Asymmetric Network Combining CNN and Transformer for Building Extraction from Remote Sensing Images

SDSC-UNet: Dual Skip Connection ViT-Based U-Shaped Model for Building Extraction

Building Extraction With Vision Transformer

Dual-Stream Feature Extraction Network Based on CNN and Transformer for Building Extraction

A Dual-Branch Fusion Network Based on Reconstructed Transformer for Building Extraction in Remote Sensing Imagery

Extracting Building Footprint From Remote Sensing Images by an Enhanced Vision Transformer Network

D2T-Net: A dual-domain transformer network exploiting spatial and channel dimensions for semantic segmentation of urban mobile laser scanning point clouds

STransU2Net: Transformer based hybrid model for building segmentation in detailed satellite imagery

Cross-level and multiscale CNN-Transformer network for automatic building extraction from remote sensing imagery

Building Extraction from Very High Resolution Aerial Imagery Using Joint Attention Deep Neural Network

Multi-Scale Attention Network for Building Extraction from High-Resolution Remote Sensing Images

SDSNet: Building Extraction in High-Resolution Remote Sensing Images Using a Deep Convolutional Network with Cross-Layer Feature Information Interaction Filtering

IFTSDNet: An Interact-Feature Transformer Network With Spatial Detail Enhancement Module for Change Detection

LSRFormer: Efficient Transformer Supply Convolutional Neural Networks With Global Information for Aerial Image Segmentation

Dual Aggregation Transformer for Image Super-Resolution

ACTNet: A Dual-Attention Adapter with a CNN-Transformer Network for the Semantic Segmentation of Remote Sensing Imagery

A shape-aware enhancement Vision Transformer for building extraction from remote sensing imagery

Enhanced Window-Based Self-Attention with Global and Multi-Scale Representations for Remote Sensing Image Super-Resolution

Transformer based on channel-spatial attention for accurate classification of scenes in remote sensing image

ACMFNet: Attention-Based Cross-Modal Fusion Network for Building Extraction of Remote Sensing Images