Abstract:In recent years, the application of semantic segmentation methods based on the remote sensing of images has become increasingly prevalent across a diverse range of domains, including but not limited to forest detection, water body detection, urban rail transportation planning, and building extraction. With the incorporation of the Transformer model into computer vision, the efficacy and accuracy of these algorithms have been significantly enhanced. Nevertheless, the Transformer model's high computational complexity and dependence on a pre-training weight of large datasets leads to a slow convergence during the training for remote sensing segmentation tasks. Motivated by the success of the adapter module in the field of natural language processing, this paper presents a novel adapter module (ResAttn) for improving the model training speed for remote sensing segmentation. The ResAttn adopts a dual-attention structure in order to capture the interdependencies between sets of features, thereby improving its global modeling capabilities, and introduces a Swin Transformer-like down-sampling method to reduce information loss and retain the original architecture while reducing the resolution. In addition, the existing Transformer model is limited in its ability to capture local high-frequency information, which can lead to an inadequate extraction of edge and texture features. To address these issues, this paper proposes a Local Feature Extractor (LFE) module, which is based on a convolutional neural network (CNN), and incorporates multi-scale feature extraction and residual structure to effectively overcome this limitation. Further, a mask-based segmentation method is employed and a residual-enhanced deformable attention block (Deformer Block) is incorporated to improve the small target segmentation accuracy. Finally, a sufficient number of experiments were performed on the ISPRS Potsdam datasets. The experimental results demonstrate the superior performance of the model described in this paper.

A dual-branch hybrid network of CNN and transformer with adaptive keyframe scheduling for video semantic segmentation

EHANet: Efficient Hybrid Attention Network Towards Real-time Semantic Segmentation

SSNet: A Novel Transformer and CNN Hybrid Network for Remote Sensing Semantic Segmentation

Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

Hybrid Dilated Convolution Network Using Attentive Kernels for Real-Time Semantic Segmentation

PSSD-Transformer: Powerful Sparse Spike-Driven Transformer for Image Semantic Segmentation

Dual-Augmented Transformer Network for Weakly Supervised Semantic Segmentation

LACTNet: A Lightweight Real-Time Semantic Segmentation Network Based on an Aggregated Convolutional Neural Network and Transformer

Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes

DARSegNet: A Real-Time Semantic Segmentation Method Based on Dual Attention Fusion Module and Encoder-Decoder Network

Asymmetric-Convolution-Guided Multipath Fusion for Real-Time Semantic Segmentation Networks

Bilateral Network with Residual U-blocks and Dual-Guided Attention for Real-time Semantic Segmentation

TCNet: Multiscale Fusion of Transformer and CNN for Semantic Segmentation of Remote Sensing Images

ACTNet: A Dual-Attention Adapter with a CNN-Transformer Network for the Semantic Segmentation of Remote Sensing Imagery

A Fast Attention-Guided Hierarchical Decoding Network for Real-Time Semantic Segmentation

Dual-Path Feature Fusion Network for Semantic Segmentation of Remote Sensing Images

HAFormer: Unleashing the Power of Hierarchy-Aware Features for Lightweight Semantic Segmentation

Adaptive multi-scale dual attention network for semantic segmentation

Weakly Supervised Video Object Segmentation via Dual-attention Cross-branch Fusion

Efficient Dual-Branch Bottleneck Networks of Semantic Segmentation Based on CCD Camera

Dual Correlation Network for Efficient Video Semantic Segmentation