An efficient point cloud semantic segmentation network with multiscale super-patch transformer

Yongwei Miao,Yuliang Sun,Yimin Zhang,Jinrong Wang,Xudong Zhang
DOI: https://doi.org/10.1038/s41598-024-63451-8
IF: 4.6
2024-06-27
Scientific Reports
Abstract:Efficient semantic segmentation of large-scale point cloud scenes is a fundamental and essential task for perception or understanding the surrounding 3d environments. However, due to the vast amount of point cloud data, it is always a challenging to train deep neural networks efficiently and also difficult to establish a unified model to represent different shapes effectively due to their variety and occlusions of scene objects. Taking scene super-patch as data representation and guided by its contextual information, we propose a novel multiscale super-patch transformer network (MSSPTNet) for point cloud segmentation, which consists of a multiscale super-patch local aggregation (MSSPLA) module and a super-patch transformer (SPT) module. Given large-scale point cloud data as input, a dynamic region-growing algorithm is first adopted to extract scene super-patches from the sampling points with consistent geometric features. Then, the MSSPLA module aggregates local features and their contextual information of adjacent super-patches at different scales. Owing to the self-attention mechanism, the SPT module exploits the similarity among scene super-patches in high-level feature space. By combining these two modules, our MSSPTNet can effectively learn both local and global features from the input point clouds. Finally, the interpolating upsampling and multi-layer perceptrons are exploited to generate semantic labels for the original point cloud data. Experimental results on the public S3DIS dataset demonstrate its efficiency of the proposed network for segmenting large-scale point cloud scenes, especially for those indoor scenes with a large number of repetitive structures, i.e., the network training of our MSSPTNet is much faster than other segmentation networks by a factor of tens to hundreds.
multidisciplinary sciences
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the efficient semantic segmentation of large - scale point cloud scenes. Specifically, the authors focus on how to effectively process and understand complex 3D environments, especially indoor scenes. Due to the large amount of point cloud data, uneven or sparse distribution, and the diversity and occlusion of scene objects, training deep neural networks for effective feature learning and extraction has always been a challenge. ### Main Problems 1. **Large - scale Point Cloud Data Processing**: Large - scale point cloud data makes feature learning and extraction difficult. 2. **Effective Extraction of Local Geometric Features**: It is necessary to extract local geometric features from the point cloud to ensure that the model can understand the effective representation of different shapes. 3. **Exploration of Global Context Information**: In addition to local features, global context information also needs to be explored in order to better understand the entire scene. ### Solutions To solve the above problems, the authors propose a Multiscale Super - Patch Transformer Network (MSSPTNet). The main innovations of this network include: - **Scene Super - Patch Representation**: Scene super - patches with geometric consistency are extracted from sampling points through a dynamic region - growing algorithm, thereby reducing the time and memory consumption of network training. - **Multiscale Super - Patch Local Aggregation Module (MSSPLA)**: This module is used to aggregate local features and their context information at different scales. - **Super - Patch Transformer Module (SPT)**: The self - attention mechanism is used to calculate the similarity between scene super - patches, thereby exploring global context information in the high - level feature space. ### Specific Steps 1. **Scene Super - Patch Generation**: Use the dynamic region - growing algorithm to extract scene super - patches from the input large - scale point cloud. 2. **Feature Descriptor Calculation**: Determine the principal axis of each super - patch through principal component analysis (PCA) and calculate feature descriptors (such as centroid, normal vector, color, etc.). 3. **Multi - scale Feature Extraction and Aggregation**: Use the MSSPLA module to extract and aggregate local features at different scales. 4. **Global Information Extraction**: Use the SPT module to further extract global information through the self - attention mechanism. 5. **Final Segmentation Result Generation**: Generate semantic labels for the original point cloud data through interpolation up - sampling and multi - layer perceptron. ### Experimental Verification The experimental results show that MSSPTNet outperforms other segmentation networks on the public dataset S3DIS. Especially when dealing with indoor scenes with a large number of repetitive structures, the network training speed is increased by tens to hundreds of times. In conclusion, this paper aims to solve the efficiency and effectiveness problems in large - scale point cloud semantic segmentation by introducing scene super - patch representation and a Transformer - based framework.