Abstract:Given the prominence of 3-D sensors in recent years, 3-D point clouds are worthy to be further investigated for environment perception and scene understanding. Learning accurate local and global contexts in point clouds is pivotal for semantic segmentation, and neighbor aggregation (NA) and transformers have achieved notable success in local and global perception in point cloud analysis, respectively. Nevertheless, studying each independently is far from the optimal solution for comprehensive feature learning. To address this, we take a novel step toward investigating and integrating the structures of NA and transformers. In this article, we introduce Point Neighbor Aggregation with Transformer (PointNAT), a conceptually straightforward and effective approach aiming to enhance the performance of 3-D point cloud semantic segmentation. PointNAT consists of an NA block (NAB) for local perception, a point transformer block (PTB) for global modeling, and a hybrid block to connect NABs and PTBs. NABs effectively learn complex local features at varying scales through an improved NA operation and a multihead mechanism. PTBs efficiently perform global attention using a small set of learnable key points. Hybrid blocks serve as high-and-low frequency signal hybridizers, merging the strengths of these two blocks by adaptively assigning hybrid weights to local and global contexts. We have evaluated the performance of PointNAT with state-of-the-art networks on several benchmarks, including Stanford Large-Scale 3-D Indoor Spaces (S3DIS), Toronto3D, and SensatUrban. PointNAT achieves mean intersection over union (mIoU) scores of 77.8%, 84.7%, and 65.2% in these three datasets. Furthermore, it outperforms the baseline approach PointNeXt by 3.0%, 1.3%, and 4.2% while utilizing only 59.9% of the parameters and 15.2% of the floating-point operations (FLOPs). The results demonstrate PointNAT's superior ability in accurately segmenting large-scale 3-D point cloud scenes, emphasizing its potential to advance environment perception and scene understanding. Our code is available at https://github.com/zeng-ziyin/PointNAT.

Regional-to-Local Point-Voxel Transformer for Large-Scale Indoor 3D Point Cloud Semantic Segmentation

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

PVT: Point-Voxel Transformer for Point Cloud Learning

3D Object Segmentation Using Cross-Window Point Transformer with Latent Semantic Boundary Guidance

PVT: Point-Voxel Transformer for 3D Deep Learning

3D Semantic Segmentation Using Deep Learning for Large-Scale Indoor Point Cloud

PointMS: Semantic Segmentation for Point Cloud Based on Multi-scale Directional Convolution

Local Transformer Network on 3D Point Cloud Semantic Segmentation

VTPNet for 3D deep learning on point cloud

Associate Semantic-Instance Segmentation of 3D Point Clouds Based on Local Feature Extraction

DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets

MsVFE and V-SIAM: Attention-based multi-scale feature interaction and fusion for outdoor LiDAR semantic segmentation

ESP-PCT: Enhanced VR Semantic Performance through Efficient Compression of Temporal and Spatial Redundancies in Point Cloud Transformers

Stratified Transformer for 3D Point Cloud Segmentation

PVTransformer: Point-to-Voxel Transformer for Scalable 3D Object Detection

PointNAT: Large-Scale Point Cloud Semantic Segmentation via Neighbor Aggregation With Transformer

SVT-Net: Super Light-Weight Sparse Voxel Transformer for Large Scale Place Recognition

OctFormer: Octree-based Transformers for 3D Point Clouds

PV-SSD: A Multi-Modal Point Cloud Feature Fusion Method for Projection Features and Variable Receptive Field Voxel Features

Learning Spatial and Temporal Variations for 4D Point Cloud Segmentation