Abstract:Large-scale LiDAR-based point cloud semantic segmentation is a critical task in autonomous driving perception. Almost all of the previous state-of-the-art LiDAR semantic segmentation methods are variants of sparse 3D convolution. Although the Transformer architecture is becoming popular in the field of natural language processing and 2D computer vision, its application to large-scale point cloud semantic segmentation is still limited. In this paper, we propose a LiDAR sEmantic Segmentation architecture with pure Transformer, LEST. LEST comprises two novel components: a Space Filling Curve (SFC) Grouping strategy and a Distance-based Cosine Linear Transformer, DISCO. On the public nuScenes semantic segmentation validation set and SemanticKITTI test set, our model outperforms all the other state-of-the-art methods.

What problem does this paper attempt to address?

This paper attempts to solve the problem of large - scale LiDAR point cloud semantic segmentation. Specifically, the paper focuses on how to effectively use the Transformer architecture to process large - scale point cloud data in the autonomous driving perception system to achieve high - precision semantic segmentation. ### Background and Problem In the autonomous driving system, LiDAR - based point cloud 3D environmental perception is crucial for safe and reliable driving. Different from image - based 2D perception tasks, large - scale point cloud data is irregular, sparse and unordered, which makes 3D environmental perception tasks more challenging. In particular, the 3D semantic segmentation task usually requires finer - grained and spatial information, and these requirements make the semantic segmentation task more difficult. ### Limitations of Existing Methods 1. **Traditional Methods**: Early methods such as PointNet aggregate the features of local unordered points through max - pooling, but this method is less efficient when dealing with large - scale point clouds. 2. **3D Convolution Methods**: Although sparse 3D convolution performs well in 3D object detection, in large - scale point cloud semantic segmentation tasks, its performance is limited due to the cubic complexity of the convolution kernel and the limited receptive field. 3. **Transformer Application**: Although Transformer has achieved great success in natural language processing (NLP) and 2D computer vision fields, its application in large - scale point cloud semantic segmentation is still limited. The main reason is that the scale of point cloud data is huge, and directly applying Transformer will lead to high computational complexity. ### Main Contributions of the Paper 1. **Proposing the LEST Architecture**: The authors propose a pure Transformer architecture - LEST (Large - scale LiDAR Semantic Segmentation with Transformer) for large - scale LiDAR point cloud semantic segmentation tasks. 2. **SFC Grouping Strategy**: A grouping strategy based on Space Filling Curve (SFC) is introduced to group point cloud data efficiently, and standard Transformer is used within each group to aggregate local features. This strategy ensures that the number of points in each group is almost the same, thereby reducing the computational complexity. 3. **DISCO Module**: A new linear Transformer - Distance - based Cosine Linear Transformer (DISCO) is proposed to construct a global receptive field with linear complexity. The DISCO module overcomes the limitations of traditional dot product and cosine similarity by using the 1 - norm distance between vectors as a similarity measure. ### Experimental Results On the two large - scale LiDAR semantic segmentation datasets, nuScenes and SemanticKITTI, the LEST model outperforms the existing state - of - the - art methods. The experimental results show that LEST not only improves the computational efficiency but also significantly improves the segmentation accuracy in multiple categories. ### Summary By introducing the SFC Grouping strategy and the DISCO module, this paper successfully applies Transformer to large - scale LiDAR point cloud semantic segmentation tasks, and solves the efficiency and performance problems of existing methods in dealing with large - scale point cloud data.

LEST: Large-scale LiDAR Semantic Segmentation with Transformer

SEFormer: Structure Embedding Transformer for 3D Object Detection

A Transformer-based Real-time LiDAR Semantic Segmentation Method for Restricted Mobile Devices

Rethinking Transformers for Semantic Segmentation of Remote Sensing Images.

3D Learnable Supertoken Transformer for LiDAR Point Cloud Scene Segmentation

Locality-Enhanced Transformer for Semantic Segmentation of High-Resolution Remote Sensing Images.

SDPT: Semantic-Aware Dimension-Pooling Transformer for Image Segmentation

Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers

UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery

LiDARFormer: A Unified Transformer-based Multi-task Network for LiDAR Perception

TCFNet: Transformer and CNN Fusion Model for LiDAR Point Cloud Semantic Segmentation.

Radial Transformer for Large-Scale Outdoor LiDAR Point Cloud Semantic Segmentation

Efficient Hybrid Transformer: Learning Global-local Context for Urban Sence Segmentation

Joint Semantic and Instance Segmentation in 3D Point Cloud Based on Transformer

PCPNet: an Efficient and Semantic-Enhanced Transformer Network for Point Cloud Prediction.

Semantic Segmentation of High-Resolution Remote Sensing Images Using an Improved Transformer.

Point Cloud Semantic Segmentation with Adaptive Spatial Structure Graph Transformer

Dual-resolution Transformer Combined with Multi-Layer Separable Convolution Fusion Network for Real-Time Semantic Segmentation

D2T-Net: A dual-domain transformer network exploiting spatial and channel dimensions for semantic segmentation of urban mobile laser scanning point clouds

Hybrid CNN-LSTM Architecture for LiDAR Point Clouds Semantic Segmentation

Stratified transformer for 3d point cloud segmentation