Abstract:Transformer has been widely applied in image processing tasks as a substitute for convolutional neural networks (CNNs) for feature extraction due to its superiority in global context modeling and flexibility in model generalization. However, the existing transformer-based methods for semantic segmentation of remote sensing (RS) images are still with several limitations, which can be summarized into two main aspects: 1) the transformer encoder is generally combined with CNN-based decoder, leading to inconsistency in feature representations; and 2) the strategies for global and local context information utilization are not sufficiently effective. Therefore, in this article, a global-local transformer segmentor (GLOTS) framework is proposed for the semantic segmentation of RS images to acquire consistent feature representations by adopting transformers for both encoding and decoding, in which a masked image modeling (MIM) pretrained transformer encoder is adopted to learn semantic-rich representations of input images and a multiscale global-local transformer decoder is designed to fully exploit the global and local features. Specifically, the transformer decoder uses a feature separation-aggregation module (FSAM) to utilize the feature adequately at different scales and adopts a global-local attention module (GLAM) containing global attention block (GAB) and local attention block (LAB) to capture the global and local context information, respectively. Furthermore, a learnable progressive upsampling strategy (LPUS) is proposed to restore the resolution progressively, which can flexibly recover the fine-grained details in the upsampling process. The experiment results on the three benchmark RS datasets demonstrate that the proposed GLOTS is capable of achieving better performance with some state-of-the-art methods, and the superiority of the proposed framework is also verified by ablation studies. The code will be available at https://github.com/lyhnsn/GLOTS.

SEGT: A General Spatial Expansion Group Transformer for nuScenes Lidar-based Object Detection Task

SEFormer: Structure Embedding Transformer for 3D Object Detection

DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding

Anchor-Based Transformer for Temporal LiDAR 3D Object Detection

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

Rethinking Transformers for Semantic Segmentation of Remote Sensing Images.

LEST: Large-scale LiDAR Semantic Segmentation with Transformer

MsSVT: Mixed-scale Sparse Voxel Transformer for 3D Object Detection on Point Clouds

Position-Guided Point Cloud Panoptic Segmentation Transformer

SEED: A Simple and Effective 3D DETR in Point Clouds

D2T-Net: A dual-domain transformer network exploiting spatial and channel dimensions for semantic segmentation of urban mobile laser scanning point clouds

Long-short Range Adaptive Transformer with Dynamic Sampling for 3D Object Detection

Multi-Scale Geometric Feature Extraction and Global Transformer for Real-World Indoor Point Cloud Analysis

GaussTR: Foundation Model-Aligned Gaussian Transformer for Self-Supervised 3D Spatial Understanding

Point Cloud Semantic Segmentation with Adaptive Spatial Structure Graph Transformer

GSTran: Joint Geometric and Semantic Coherence for Point Cloud Segmentation

Spatial Transformer for 3D Point Clouds

Towards Grouping in Large Scenes with Occlusion-aware Spatio-temporal Transformers

Boosting Lidar 3D Object Detection with Point Cloud Semantic Segmentation

Graph Neural Network and Spatiotemporal Transformer Attention for 3D Video Object Detection from Point Clouds

SGIFormer: Semantic-guided and Geometric-enhanced Interleaving Transformer for 3D Instance Segmentation