Abstract:Transformer has been widely applied in image processing tasks as a substitute for convolutional neural networks (CNNs) for feature extraction due to its superiority in global context modeling and flexibility in model generalization. However, the existing transformer-based methods for semantic segmentation of remote sensing (RS) images are still with several limitations, which can be summarized into two main aspects: 1) the transformer encoder is generally combined with CNN-based decoder, leading to inconsistency in feature representations; and 2) the strategies for global and local context information utilization are not sufficiently effective. Therefore, in this article, a global-local transformer segmentor (GLOTS) framework is proposed for the semantic segmentation of RS images to acquire consistent feature representations by adopting transformers for both encoding and decoding, in which a masked image modeling (MIM) pretrained transformer encoder is adopted to learn semantic-rich representations of input images and a multiscale global-local transformer decoder is designed to fully exploit the global and local features. Specifically, the transformer decoder uses a feature separation-aggregation module (FSAM) to utilize the feature adequately at different scales and adopts a global-local attention module (GLAM) containing global attention block (GAB) and local attention block (LAB) to capture the global and local context information, respectively. Furthermore, a learnable progressive upsampling strategy (LPUS) is proposed to restore the resolution progressively, which can flexibly recover the fine-grained details in the upsampling process. The experiment results on the three benchmark RS datasets demonstrate that the proposed GLOTS is capable of achieving better performance with some state-of-the-art methods, and the superiority of the proposed framework is also verified by ablation studies. The code will be available at https://github.com/lyhnsn/GLOTS.

A Novel Shape Guided Transformer Network for Instance Segmentation in Remote Sensing Images

Rethinking Transformers for Semantic Segmentation of Remote Sensing Images.

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

BiTSRS: A Bi-Decoder Transformer Segmentor for High-Spatial-Resolution Remote Sensing Images

Remote sensing image instance segmentation network with transformer and multi-scale feature representation

SDPT: Semantic-Aware Dimension-Pooling Transformer for Image Segmentation

PSSD-Transformer: Powerful Sparse Spike-Driven Transformer for Image Semantic Segmentation

Efficient Transformer for Remote Sensing Image Segmentation

Adaptive enhanced swin transformer with U-net for remote sensing image segmentation*

An Improved Swin Transformer-Based Model for Remote Sensing Object Detection and Instance Segmentation

Swin Transformer Embedding UNet for Remote Sensing Image Semantic Segmentation.

A transformer-based approach empowered by a self-attention technique for semantic segmentation in remote sensing

Spatial-specific Transformer with Involution for Semantic Segmentation of High-Resolution Remote Sensing Images

Class-Guided Swin Transformer for Semantic Segmentation of Remote Sensing Imagery

Enhancing Multiscale Representations with Transformer for Remote Sensing Image Semantic Segmentation

A Stage-Adaptive Selective Network with Position Awareness for Semantic Segmentation of LULC Remote Sensing Images

SSNet: A Novel Transformer and CNN Hybrid Network for Remote Sensing Semantic Segmentation

Accurate Instance Segmentation for Remote Sensing Images via Adaptive and Dynamic Feature Learning

Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

Enhancing Efficient Global Understanding Network with CSWin Transformer for Urban Scene Images Segmentation