Abstract:Given the prominence of 3-D sensors in recent years, 3-D point clouds are worthy to be further investigated for environment perception and scene understanding. Learning accurate local and global contexts in point clouds is pivotal for semantic segmentation, and neighbor aggregation (NA) and transformers have achieved notable success in local and global perception in point cloud analysis, respectively. Nevertheless, studying each independently is far from the optimal solution for comprehensive feature learning. To address this, we take a novel step toward investigating and integrating the structures of NA and transformers. In this article, we introduce Point Neighbor Aggregation with Transformer (PointNAT), a conceptually straightforward and effective approach aiming to enhance the performance of 3-D point cloud semantic segmentation. PointNAT consists of an NA block (NAB) for local perception, a point transformer block (PTB) for global modeling, and a hybrid block to connect NABs and PTBs. NABs effectively learn complex local features at varying scales through an improved NA operation and a multihead mechanism. PTBs efficiently perform global attention using a small set of learnable key points. Hybrid blocks serve as high-and-low frequency signal hybridizers, merging the strengths of these two blocks by adaptively assigning hybrid weights to local and global contexts. We have evaluated the performance of PointNAT with state-of-the-art networks on several benchmarks, including Stanford Large-Scale 3-D Indoor Spaces (S3DIS), Toronto3D, and SensatUrban. PointNAT achieves mean intersection over union (mIoU) scores of 77.8%, 84.7%, and 65.2% in these three datasets. Furthermore, it outperforms the baseline approach PointNeXt by 3.0%, 1.3%, and 4.2% while utilizing only 59.9% of the parameters and 15.2% of the floating-point operations (FLOPs). The results demonstrate PointNAT's superior ability in accurately segmenting large-scale 3-D point cloud scenes, emphasizing its potential to advance environment perception and scene understanding. Our code is available at https://github.com/zeng-ziyin/PointNAT.

RS-TNet: point cloud transformer with relation-shape awareness for fine-grained 3D visual processing

Text to Point Cloud Localization with Relation-Enhanced Transformer.

EGCT: Enhanced Graph Convolutional Transformer for 3D Point Cloud Representation Learning

Learning Point Cloud Shapes with Geometric and Topological Structures.

Spatial Transformer for 3D Point Clouds

Group-in-Group Relation-Based Transformer for 3D Point Cloud Learning

Learning point cloud context information based on 3D transformer for more accurate and efficient classification

Local Transformer Network on 3D Point Cloud Semantic Segmentation

3DPCTN: Two 3D Local-Object Point-Cloud-Completion Transformer Networks Based on Self-Attention and Multi-Resolution

ResSANet: Learning Geometric Information for Point Cloud Processing

3DCTN: 3D Convolution-Transformer Network for Point Cloud Classification

PointNAT: Large-Scale Point Cloud Semantic Segmentation via Neighbor Aggregation With Transformer

GTNet: Graph Transformer Network for 3D Point Cloud Classification and Semantic Segmentation

Stratified Transformer for 3D Point Cloud Segmentation

PointCAT: Cross-Attention Transformer for point cloud

Dynamic clustering transformer network for point cloud segmentation

APPT : Asymmetric Parallel Point Transformer for 3D Point Cloud Understanding

Point Transformer V3: Simpler, Faster, Stronger

TT-Net: Tensorized Transformer Network for 3D medical image segmentation

D2T-Net: A dual-domain transformer network exploiting spatial and channel dimensions for semantic segmentation of urban mobile laser scanning point clouds

Point Cloud Completion Via Skeleton-Detail Transformer