Abstract:Transformer has been widely applied in image processing tasks as a substitute for convolutional neural networks (CNNs) for feature extraction due to its superiority in global context modeling and flexibility in model generalization. However, the existing transformer-based methods for semantic segmentation of remote sensing (RS) images are still with several limitations, which can be summarized into two main aspects: 1) the transformer encoder is generally combined with CNN-based decoder, leading to inconsistency in feature representations; and 2) the strategies for global and local context information utilization are not sufficiently effective. Therefore, in this article, a global-local transformer segmentor (GLOTS) framework is proposed for the semantic segmentation of RS images to acquire consistent feature representations by adopting transformers for both encoding and decoding, in which a masked image modeling (MIM) pretrained transformer encoder is adopted to learn semantic-rich representations of input images and a multiscale global-local transformer decoder is designed to fully exploit the global and local features. Specifically, the transformer decoder uses a feature separation-aggregation module (FSAM) to utilize the feature adequately at different scales and adopts a global-local attention module (GLAM) containing global attention block (GAB) and local attention block (LAB) to capture the global and local context information, respectively. Furthermore, a learnable progressive upsampling strategy (LPUS) is proposed to restore the resolution progressively, which can flexibly recover the fine-grained details in the upsampling process. The experiment results on the three benchmark RS datasets demonstrate that the proposed GLOTS is capable of achieving better performance with some state-of-the-art methods, and the superiority of the proposed framework is also verified by ablation studies. The code will be available at https://github.com/lyhnsn/GLOTS.

Cross-scale sampling transformer for semantic image segmentation

ScaleFormer: Revisiting the Transformer-based Backbones from a Scale-wise Perspective for Medical Image Segmentation.

Feature Selective Transformer for Semantic Image Segmentation

Rethinking Transformers for Semantic Segmentation of Remote Sensing Images.

Full-Scale Selective Transformer for Semantic Segmentation.

Multi-Scale Transformer with Explicit Boundary Constraint for Semantic Segmentation

Transformer Scale Gate for Semantic Segmentation

SSDT: Scale-Separation Semantic Decoupled Transformer for Semantic Segmentation of Remote Sensing Images

Enhanced Multi-Scale Feature Adaptive Fusion Sparse Convolutional Network for Large-Scale Scenes Semantic Segmentation

A Dynamic Cross-Scale Transformer with Dual-Compound Representation for 3D Medical Image Segmentation

Cross-scale Vision Transformer for crowd localization

A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data

Semantic segmentation using cross-stage feature reweighting and efficient self-attention

Cross-Scale Feature Propagation Network for Semantic Segmentation of High-Resolution Remote Sensing Images

Based on cross-scale fusion attention mechanism network for semantic segmentation for street scenes

CNN and Transformer Fusion for Remote Sensing Image Semantic Segmentation

Swin Transformer with Multi-Scale Residual Attention for Semantic Segmentation of Remote Sensing Images.

Remote Sensing Image Semantic Segmentation Network Based on Multi-Scale Feature Enhancement Fusion

Multi-Scale Feature Aggregation by Cross-Scale Pixel-to-Region Relation Operation for Semantic Segmentation

TCNet: Multiscale Fusion of Transformer and CNN for Semantic Segmentation of Remote Sensing Images

Semi-Supervised Adversarial Semantic Segmentation Network Using Transformer and Multiscale Convolution for High-Resolution Remote Sensing Imagery