Abstract:Transformer has been widely applied in image processing tasks as a substitute for convolutional neural networks (CNNs) for feature extraction due to its superiority in global context modeling and flexibility in model generalization. However, the existing transformer-based methods for semantic segmentation of remote sensing (RS) images are still with several limitations, which can be summarized into two main aspects: 1) the transformer encoder is generally combined with CNN-based decoder, leading to inconsistency in feature representations; and 2) the strategies for global and local context information utilization are not sufficiently effective. Therefore, in this article, a global-local transformer segmentor (GLOTS) framework is proposed for the semantic segmentation of RS images to acquire consistent feature representations by adopting transformers for both encoding and decoding, in which a masked image modeling (MIM) pretrained transformer encoder is adopted to learn semantic-rich representations of input images and a multiscale global-local transformer decoder is designed to fully exploit the global and local features. Specifically, the transformer decoder uses a feature separation-aggregation module (FSAM) to utilize the feature adequately at different scales and adopts a global-local attention module (GLAM) containing global attention block (GAB) and local attention block (LAB) to capture the global and local context information, respectively. Furthermore, a learnable progressive upsampling strategy (LPUS) is proposed to restore the resolution progressively, which can flexibly recover the fine-grained details in the upsampling process. The experiment results on the three benchmark RS datasets demonstrate that the proposed GLOTS is capable of achieving better performance with some state-of-the-art methods, and the superiority of the proposed framework is also verified by ablation studies. The code will be available at https://github.com/lyhnsn/GLOTS.

ELiFormer: A Hierarchical Transformer Based Model with Efficient Encoder and Lightweight Decoder for Semantic Segmentation.

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers

HAFormer: Unleashing the Power of Hierarchy-Aware Features for Lightweight Semantic Segmentation

Lightweight Convolutional Neural Networks with Context Broadcast Transformer for Real-Time Semantic Segmentation

A Transformer-Based Decoder for Semantic Segmentation with Multi-level Context Mining

FeedFormer: Revisiting Transformer Decoder for Efficient Semantic Segmentation

Enhancing Mask Transformer with Auxiliary Convolution Layers for Semantic Segmentation

Head-Free Lightweight Semantic Segmentation with Linear Transformer

Lightweight Real-time Semantic Segmentation Network with Efficient Transformer and CNN

MeshFormer: High-resolution Mesh Segmentation with Graph Transformer

Rethinking Transformers for Semantic Segmentation of Remote Sensing Images.

Semantic segmentation using cross-stage feature reweighting and efficient self-attention

Lightweight Transformer Traffic Scene Semantic Segmentation Algorithm Integrating Multi-Scale Depth Convolution

LACTNet: A Lightweight Real-Time Semantic Segmentation Network Based on an Aggregated Convolutional Neural Network and Transformer

Dual-resolution Transformer Combined with Multi-Layer Separable Convolution Fusion Network for Real-Time Semantic Segmentation

TBFormer: three-branch efficient transformer for semantic segmentation

Locality-Enhanced Transformer for Semantic Segmentation of High-Resolution Remote Sensing Images.

MISSFormer: An Effective Medical Image Segmentation Transformer

Embedding-Free Transformer with Inference Spatial Reduction for Efficient Semantic Segmentation

HD-Former: A hierarchical dependency Transformer for medical image segmentation

LACTNet: A Lightweight Real-time Semantic Segmentation Network Based on Aggregation CNN and Transformer