Abstract:In recent years, the application of semantic segmentation methods based on the remote sensing of images has become increasingly prevalent across a diverse range of domains, including but not limited to forest detection, water body detection, urban rail transportation planning, and building extraction. With the incorporation of the Transformer model into computer vision, the efficacy and accuracy of these algorithms have been significantly enhanced. Nevertheless, the Transformer model's high computational complexity and dependence on a pre-training weight of large datasets leads to a slow convergence during the training for remote sensing segmentation tasks. Motivated by the success of the adapter module in the field of natural language processing, this paper presents a novel adapter module (ResAttn) for improving the model training speed for remote sensing segmentation. The ResAttn adopts a dual-attention structure in order to capture the interdependencies between sets of features, thereby improving its global modeling capabilities, and introduces a Swin Transformer-like down-sampling method to reduce information loss and retain the original architecture while reducing the resolution. In addition, the existing Transformer model is limited in its ability to capture local high-frequency information, which can lead to an inadequate extraction of edge and texture features. To address these issues, this paper proposes a Local Feature Extractor (LFE) module, which is based on a convolutional neural network (CNN), and incorporates multi-scale feature extraction and residual structure to effectively overcome this limitation. Further, a mask-based segmentation method is employed and a residual-enhanced deformable attention block (Deformer Block) is incorporated to improve the small target segmentation accuracy. Finally, a sufficient number of experiments were performed on the ISPRS Potsdam datasets. The experimental results demonstrate the superior performance of the model described in this paper.

TextFormer: Component-aware Text Segmentation with Transformer.

SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation

MeshFormer: High-resolution Mesh Segmentation with Graph Transformer

SRFormer: Text Detection Transformer with Incorporated Segmentation and Regression

EAFormer: Scene Text Segmentation with Edge-Aware Transformers

Graph-Segmenter: Graph Transformer with Boundary-aware Attention for Semantic Segmentation

TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision

CardiacSegFormer: Transformer for Semantic Segmentation of Cardiac Images.

Character Queries: A Transformer-based Approach to On-Line Handwritten Character Segmentation

ASFormer: Transformer for Action Segmentation

GLaLT: Global-Local Attention-Augmented Light Transformer for Scene Text Recognition

Local Transformer Network on 3D Point Cloud Semantic Segmentation

Masked-attention Mask Transformer for Universal Image Segmentation

DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer

Aggregated Text Transformer for Scene Text Detection

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

ScribFormer: Transformer Makes CNN Work Better for Scribble-based Medical Image Segmentation

ACTNet: A Dual-Attention Adapter with a CNN-Transformer Network for the Semantic Segmentation of Remote Sensing Imagery

Locality-Enhanced Transformer for Semantic Segmentation of High-Resolution Remote Sensing Images.

Hybrid Attention Fusion Embedded in Transformer for Remote Sensing Image Semantic Segmentation

Transforming Scene Text Detection and Recognition: A Multi-Scale End-to-End Approach With Transformer Framework