Abstract:Transformer has been widely applied in image processing tasks as a substitute for convolutional neural networks (CNNs) for feature extraction due to its superiority in global context modeling and flexibility in model generalization. However, the existing transformer-based methods for semantic segmentation of remote sensing (RS) images are still with several limitations, which can be summarized into two main aspects: 1) the transformer encoder is generally combined with CNN-based decoder, leading to inconsistency in feature representations; and 2) the strategies for global and local context information utilization are not sufficiently effective. Therefore, in this article, a global-local transformer segmentor (GLOTS) framework is proposed for the semantic segmentation of RS images to acquire consistent feature representations by adopting transformers for both encoding and decoding, in which a masked image modeling (MIM) pretrained transformer encoder is adopted to learn semantic-rich representations of input images and a multiscale global-local transformer decoder is designed to fully exploit the global and local features. Specifically, the transformer decoder uses a feature separation-aggregation module (FSAM) to utilize the feature adequately at different scales and adopts a global-local attention module (GLAM) containing global attention block (GAB) and local attention block (LAB) to capture the global and local context information, respectively. Furthermore, a learnable progressive upsampling strategy (LPUS) is proposed to restore the resolution progressively, which can flexibly recover the fine-grained details in the upsampling process. The experiment results on the three benchmark RS datasets demonstrate that the proposed GLOTS is capable of achieving better performance with some state-of-the-art methods, and the superiority of the proposed framework is also verified by ablation studies. The code will be available at https://github.com/lyhnsn/GLOTS.

DSViT: Dynamically Scalable Vision Transformer for Remote Sensing Image Segmentation and Classification.

SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation

SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation.

Rethinking Transformers for Semantic Segmentation of Remote Sensing Images.

SDPT: Semantic-Aware Dimension-Pooling Transformer for Image Segmentation

A Bio-Inspired Visual Perception Transformer for Cross-Domain Semantic Segmentation of High-Resolution Remote Sensing Images

Efficient Transformer for Remote Sensing Image Segmentation

Vision Transformers for Remote Sensing Image Classification

Multi-Scale Sparse Transformer for Remote Sensing Scene Classification

DilateFormer: Multi-Scale Dilated Transformer for Visual Recognition

Advancing Plain Vision Transformer Towards Remote Sensing Foundation Model

Hybrid Attention Fusion Embedded in Transformer for Remote Sensing Image Semantic Segmentation

DAT++: Spatially Dynamic Vision Transformer with Deformable Attention

Vision Transformer with Sparse Scan Prior

Automated classification of remote sensing satellite images using deep learning based vision transformer

Human Body Segmentation in Wide-Angle Images Based on Fast Vision Transformers

Triple Attention Vision Transformers for Remote Sensing Image Classification

Transformer and CNN Hybrid Deep Neural Network for Semantic Segmentation of Very-High-Resolution Remote Sensing Imagery

Dynamic multi-headed self-attention and multiscale enhancement vision transformer for object detection

A transformer-based approach empowered by a self-attention technique for semantic segmentation in remote sensing