Abstract:Transformer has been widely applied in image processing tasks as a substitute for convolutional neural networks (CNNs) for feature extraction due to its superiority in global context modeling and flexibility in model generalization. However, the existing transformer-based methods for semantic segmentation of remote sensing (RS) images are still with several limitations, which can be summarized into two main aspects: 1) the transformer encoder is generally combined with CNN-based decoder, leading to inconsistency in feature representations; and 2) the strategies for global and local context information utilization are not sufficiently effective. Therefore, in this article, a global-local transformer segmentor (GLOTS) framework is proposed for the semantic segmentation of RS images to acquire consistent feature representations by adopting transformers for both encoding and decoding, in which a masked image modeling (MIM) pretrained transformer encoder is adopted to learn semantic-rich representations of input images and a multiscale global-local transformer decoder is designed to fully exploit the global and local features. Specifically, the transformer decoder uses a feature separation-aggregation module (FSAM) to utilize the feature adequately at different scales and adopts a global-local attention module (GLAM) containing global attention block (GAB) and local attention block (LAB) to capture the global and local context information, respectively. Furthermore, a learnable progressive upsampling strategy (LPUS) is proposed to restore the resolution progressively, which can flexibly recover the fine-grained details in the upsampling process. The experiment results on the three benchmark RS datasets demonstrate that the proposed GLOTS is capable of achieving better performance with some state-of-the-art methods, and the superiority of the proposed framework is also verified by ablation studies. The code will be available at https://github.com/lyhnsn/GLOTS.

Spatial-specific Transformer with Involution for Semantic Segmentation of High-Resolution Remote Sensing Images

BiTSRS: A Bi-Decoder Transformer Segmentor for High-Spatial-Resolution Remote Sensing Images

Rethinking Transformers for Semantic Segmentation of Remote Sensing Images.

Efficient Transformer for Remote Sensing Image Segmentation

Semantic Segmentation of High-Resolution Remote Sensing Images Using an Improved Transformer.

Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

Swin Transformer with Multi-Scale Residual Attention for Semantic Segmentation of Remote Sensing Images.

ResU-Former: Advancing Remote Sensing Image Segmentation with Swin Residual Transformer for Precise Global–Local Feature Recognition and Visual–Semantic Space Learning

Enhancing Multiscale Representations with Transformer for Remote Sensing Image Semantic Segmentation

Cascaded CNN and global–local attention transformer network-based semantic segmentation for high-resolution remote sensing image

Local-enhanced multi-scale aggregation swin transformer for semantic segmentation of high-resolution remote sensing images

Locality-Enhanced Transformer for Semantic Segmentation of High-Resolution Remote Sensing Images.

Hybrid Attention Fusion Embedded in Transformer for Remote Sensing Image Semantic Segmentation

Class-Guided Swin Transformer for Semantic Segmentation of Remote Sensing Imagery

A Semantic Segmentation Method for Remote Sensing Images Based on the Swin Transformer Fusion Gabor Filter

Swin Transformer Embedding UNet for Remote Sensing Image Semantic Segmentation.

Remote sensing image instance segmentation network with transformer and multi-scale feature representation

Adaptive enhanced swin transformer with U-net for remote sensing image segmentation*

STransFuse: Fusing Swin Transformer and Convolutional Neural Network for Remote Sensing Image Semantic Segmentation

Multiscale Feature Learning by Transformer for Building Extraction From Satellite Images

Integrating Spatial Details with Long-Range Contexts for Semantic Segmentation of Very High-Resolution Remote-Sensing Images.