Voxel- and Bird's-Eye-View-Based Semantic Scene Completion for LiDAR Point Clouds

Li Liang,Naveed Akhtar,Jordan Vice,Ajmal Mian
DOI: https://doi.org/10.3390/rs16132266
IF: 5
2024-06-21
Remote Sensing
Abstract:Semantic scene completion is a crucial outdoor scene understanding task that has direct implications for technologies like autonomous driving and robotics. It compensates for unavoidable occlusions and partial measurements in LiDAR scans, which may otherwise cause catastrophic failures. Due to the inherent complexity of this task, existing methods generally rely on complex and computationally demanding scene completion models, which limits their practicality in downstream applications. Addressing this, we propose a novel integrated network that combines the strengths of 3D and 2D semantic scene completion techniques for efficient LiDAR point cloud scene completion. Our network leverages a newly devised lightweight multi-scale convolutional block (MSB) to efficiently aggregate multi-scale features, thereby improving the identification of small and distant objects. It further utilizes a layout-aware semantic block (LSB), developed to grasp the overall layout of the scene to precisely guide the reconstruction and recognition of features. Moreover, we also develop a feature fusion module (FFM) for effective interaction between the data derived from two disparate streams in our network, ensuring a robust and cohesive scene completion process. Extensive experiments with the popular SemanticKITTI dataset demonstrate that our method achieves highly competitive performance, with an mIoU of 35.7 and an IoU of 51.4. Notably, the proposed method achieves an mIoU improvement of 2.6 % compared to previous methods.
environmental sciences,geosciences, multidisciplinary,imaging science & photographic technology,remote sensing
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in autonomous driving and robotics, how to use LiDAR point cloud data for efficient 3D Semantic Scene Completion. Specifically, the research aims to overcome the problems existing in existing methods, such as high computational complexity, large model parameters, and insufficient processing of sparse and partially measured data, by combining 3D and 2D semantic scene completion techniques. ### Main problems and challenges of the paper 1. **Data sparsity**: LiDAR point cloud data is inherently sparse, which makes it difficult for machines to fully understand the environment. 2. **Computational complexity**: Existing 3D semantic scene completion methods usually rely on complex models, which are computationally costly and limit their practicality in real - world applications. 3. **Initial segmentation error propagation**: Some existing methods (such as JS3C - Net) may propagate initial segmentation errors during the scene completion process, resulting in information loss or inaccuracy. 4. **Multi - modal fusion**: How to effectively fuse information from different perspectives (such as voxel view and bird's - eye view) to improve the accuracy and robustness of scene completion. ### Overview of the solution To solve the above problems, the author proposes an integrated network that combines the 3D Semantic Scene Completion Network (3D SSCNet) and the 2D Semantic Scene Completion Network (2D SSCNet), and introduces the following innovative components: 1. **Multi - scale Convolution Block (MSB)**: Used for efficient aggregation of multi - scale features, enhancing the recognition ability of small objects, distant objects, and dense scenes. 2. **Layout - aware Semantic Block (LSB)**: Helps the network understand the overall layout of the scene and precisely guides feature reconstruction and recognition. 3. **Feature Fusion Module (FFM)**: Effectively integrates data from 3D and 2D networks, ensuring that the two complement each other's advantages and improving the overall scene completion effect. ### Specific implementation of the method #### 3D Semantic Scene Completion Network (3D SSCNet) - **Multi - scale Convolution Block (MSB)**: - Simulates the effect of large - size convolution kernels through the combination of multiple 3×3×3 convolution layers, thereby reducing the amount of computation. - The formula is as follows: \[ F_{\text{out}}=\sum_{i = 1}^{N}W_iF_{\text{in}} \] where \(W_i\) is the convolution kernel weight of the \(i\)-th scale, and \(F_{\text{in}}\) and \(F_{\text{out}}\) are the input and output feature maps respectively. - **Layout - aware Semantic Block (LSB)**: - Uses three - dimensional decomposed residual (DDR) blocks with progressively increasing dilation rates to capture the spatial layout context. - The formula is as follows: \[ F_{\text{out}1}=\sigma(W_{d1}F_{3D}+b_1) \] \[ F_{\text{out}2}=\sigma(W_{d2}F_{\text{out}1}+b_2) \] \[ F_{\text{out}3}=\sigma(W_{d3}F_{\text{out}2}+b_3) \] where \(d_1, d_2, d_3\) are the dilation rates, \(W_{d1}, W_{d2}, W_{d3}\) are the corresponding weight matrices, and \(\sigma\) is a non - linear activation function. #### 2D Semantic Scene Completion Network (2D SSCNet) - Utilizes bird's - eye view (BEV) features to provide accurate spatial layout information and enhance the performance of 3D semantic scene completion. - Contains a lightweight 2D encoder - decoder architecture that is specifically optimized for semantic scene completion. #### Feature Fusion Module (FFM) - Integrates 3D and 2D features to ensure that the advantages of the two can complement each other and improve the overall accuracy and depth of scene completion. ### Experimental results Through extensive experiments on the SemanticKITTI dataset, this method obtains...