AMVNet: Assertion-based Multi-View Fusion Network for LiDAR Semantic Segmentation

Venice Erin Liong,Thi Ngoc Tho Nguyen,Sergi Widjaja,Dhananjai Sharma,Zhuang Jie Chong
DOI: https://doi.org/10.48550/arXiv.2012.04934
2020-12-09
Abstract:In this paper, we present an Assertion-based Multi-View Fusion network (AMVNet) for LiDAR semantic segmentation which aggregates the semantic features of individual projection-based networks using late fusion. Given class scores from different projection-based networks, we perform assertion-guided point sampling on score disagreements and pass a set of point-level features for each sampled point to a simple point head which refines the predictions. This modular-and-hierarchical late fusion approach provides the flexibility of having two independent networks with a minor overhead from a light-weight network. Such approaches are desirable for robotic systems, e.g. autonomous vehicles, for which the computational and memory resources are often limited. Extensive experiments show that AMVNet achieves state-of-the-art results in both the SemanticKITTI and nuScenes benchmark datasets and that our approach outperforms the baseline method of combining the class scores of the projection-based networks.
Computer Vision and Pattern Recognition,Machine Learning,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the accuracy of LiDAR point - cloud semantic segmentation. Specifically, the authors propose an Assertion - based Multi - View Fusion Network (AMVNet), aiming to combine the advantages of the Range View (RV) and Bird - Eye View (BEV) networks to achieve more accurate point - cloud semantic segmentation. The following are the specific problems described in the paper: 1. **Limitations of single - view methods**: - The Range View (RV) method performs well when dealing with near - distance objects (such as parking spaces and roads), but in the case of long - distance or dense point clouds, there may be a problem where multiple 3D points are projected onto the same pixel, resulting in inaccurate representation. - The Bird - Eye View (BEV) method performs well when dealing with long - distance objects, but has difficulties in representing sparse point clouds and objects in the vertical direction. 2. **Need for multi - view fusion**: - Since different view methods have their own advantages and disadvantages in different scenarios, simply relying on one view method cannot obtain the optimal result. Therefore, a method that can effectively fuse multiple view information is needed to fully utilize the advantages of each method. 3. **Deficiencies of existing fusion methods**: - Most of the existing multi - view fusion methods focus on early - feature fusion or sequential fusion, and these methods will encounter the problem of large computational overhead during late - stage fusion. In addition, how to effectively select uncertain points for processing is also a challenge. ### Solutions proposed in the paper To address the above problems, the paper proposes the following solutions: - **Assertion - based Multi - View Fusion Network (AMVNet)**: - **Multi - view network**: By projecting the point cloud onto a structured representation form (such as RV and BEV), and using an encoder - decoder network for semantic segmentation, an initial point - level category prediction is obtained. - **Assertion - guided point sampling strategy**: According to the prediction differences of categories by the RV and BEV networks, uncertain points are selected for further processing. Specifically, the cosine similarity score between the two network predictions is calculated, and a threshold is set to mark the uncertain points. - **Nod - head architecture**: For each uncertain point, its point - level features and the features of its neighboring points are extracted and passed to a lightweight nod - head network to obtain the final prediction result. ### Experimental results The paper conducted experiments on two benchmark datasets, SemanticKITTI and nuScenes. The results show that AMVNet achieves better performance than the baseline methods in multiple categories and reaches the state - of - the - art level in the mIOU metric. In particular, AMVNet performs excellently in categories such as bicycles, motorcycles, and pedestrians, proving its effectiveness in dealing with complex scenes. ### Summary By introducing the Assertion - based Multi - View Fusion Network (AMVNet), the paper successfully solves the problem of inconsistent performance of single - view methods in different scenarios and achieves more accurate LiDAR point - cloud semantic segmentation. This method not only improves the segmentation accuracy but also maintains a low computational overhead, which is suitable for application scenarios with limited computing resources such as autonomous driving.