Efficiently Expanding Receptive Fields: Local Split Attention and Parallel Aggregation for Enhanced Large-scale Point Cloud Semantic Segmentation

Haodong Wang,Chongyu Wang,Yinghui Quan,Di Wang
2024-09-03
Abstract:Expanding the receptive field in a deep learning model for large-scale 3D point cloud segmentation is an effective technique for capturing rich contextual information, which consequently enhances the network's ability to learn meaningful features. However, this often leads to increased computational complexity and risk of overfitting, challenging the efficiency and effectiveness of the learning paradigm. To address these limitations, we propose the Local Split Attention Pooling (LSAP) mechanism to effectively expand the receptive field through a series of local split operations, thus facilitating the acquisition of broader contextual knowledge. Concurrently, it optimizes the computational workload associated with attention-pooling layers to ensure a more streamlined processing workflow. Based on LSAP, a Parallel Aggregation Enhancement (PAE) module is introduced to enable parallel processing of data using both 2D and 3D neighboring information to further enhance contextual representations within the network. In light of the aforementioned designs, we put forth a novel framework, designated as LSNet, for large-scale point cloud semantic segmentation. Extensive evaluations demonstrated the efficacy of seamlessly integrating the proposed PAE module into existing frameworks, yielding significant improvements in mean intersection over union (mIoU) metrics, with a notable increase of up to 11%. Furthermore, LSNet demonstrated superior performance compared to state-of-the-art semantic segmentation networks on three benchmark datasets, including S3DIS, Toronto3D, and SensatUrban. It is noteworthy that our method achieved a substantial speedup of approximately 38.8% compared to those employing similar-sized receptive fields, which serves to highlight both its computational efficiency and practical utility in real-world large-scale scenes.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems the paper attempts to solve This paper aims to find an effective method to expand the receptive field in large - scale 3D point cloud semantic segmentation. Specifically, the authors propose a new mechanism - Local Split Attention Pooling (LSAP) and Parallel Aggregation Enhancement (PAE) to effectively expand the receptive field of the network while maintaining computational efficiency, so as to capture more abundant context information and improve the learning ability of the model. #### Main problems and challenges 1. **Necessity of expanding the receptive field**: - In large - scale 3D point cloud semantic segmentation tasks, expanding the receptive field is crucial for capturing rich context information. This helps the model better understand the complex structures and object relationships in the scene. 2. **Computational complexity and over - fitting risk**: - Expanding the receptive field usually leads to an increase in computational complexity, which in turn affects the training efficiency and generalization ability of the model. In addition, an overly large receptive field may cause the model to over - fit, especially when the amount of data is limited. 3. **Limitations of existing methods**: - Although projection methods and voxel methods can handle large - scale point clouds, they have problems of losing geometric structures and high computational costs. - Point cloud direct processing methods (such as PointNet and its variants) can effectively handle irregular point clouds, but still face challenges in computational efficiency and feature representation ability in large - scale scenarios. ### Solutions 1. **Local Split Attention Pooling (LSAP) mechanism**: - Through local split operations, the current feature is divided into two parts, and attention pooling is performed on them respectively, thereby expanding the receptive field while reducing the computational cost. - Specific steps include: - Use the KNN algorithm to find the neighbor point indices of each point. - Perform two split operations. For the first time, select the nearest \( s_1 \) neighbor points, and for the second time, select one point every \( s_2 \) neighbor points. - Embed the spatial coordinate information of neighbor points through the Relative Position - based Point - wise Pyramid Encoding (RPPE) method. - Use a multi - layer perceptron (MLP) and an attention mechanism to perform pooling operations on local features. 2. **Parallel Aggregation Enhancement (PAE)**: - By combining 2D - KNN and 3D - KNN algorithms, simultaneously capture the local features of point clouds in the plane and vertical directions, so as to obtain context information more comprehensively. - Specific steps include: - Pass the input features through an MLP to increase the number of channels. - Divide them into two parallel branches and use 2D - KNN and 3D - KNN to search for neighbor points respectively. - Send the local features of the two branches to the LSAP module for local split attention pooling operations. - Finally, concatenate the two feature vectors and learn through an MLP to form a residual block, enhancing the stability and reliability of the model. 3. **Feature Max Aggregation (FMA) module**: - In the decoding layer, through up - sampling and max - pooling operations, aggregate point features at different resolutions to ensure that the most effective features are retained. - Specific steps include: - Upsample the decoding features of the previous layer to make their resolution consistent with that of the encoding features of the current layer. - Aggregate the up - sampled features and neighbor point features, and then perform a max - pooling operation. - Concatenate the max - pooled features with the encoding features of the current layer and learn through an MLP to form the final point feature vector. ### Experimental results The authors conducted experiments on three large - scale benchmark datasets (S3DIS, Toronto3D, SensatUrban) to verify the effectiveness of the proposed LSNet framework. The experimental results show that: - **Performance improvement**: LSNet is significantly superior to the existing state - of - the - art methods in the mean Intersection over Union (mIoU) metric, with a maximum improvement of 11%. - **Computational efficiency**: Compared with methods with a similar - sized receptive field, LSNet achieves an acceleration of about 38.8%, demonstrating its practicality.