Abstract:In this paper, we present an Assertion-based Multi-View Fusion network (AMVNet) for LiDAR semantic segmentation which aggregates the semantic features of individual projection-based networks using late fusion. Given class scores from different projection-based networks, we perform assertion-guided point sampling on score disagreements and pass a set of point-level features for each sampled point to a simple point head which refines the predictions. This modular-and-hierarchical late fusion approach provides the flexibility of having two independent networks with a minor overhead from a light-weight network. Such approaches are desirable for robotic systems, e.g. autonomous vehicles, for which the computational and memory resources are often limited. Extensive experiments show that AMVNet achieves state-of-the-art results in both the SemanticKITTI and nuScenes benchmark datasets and that our approach outperforms the baseline method of combining the class scores of the projection-based networks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is to improve the accuracy of LiDAR point - cloud semantic segmentation. Specifically, the authors propose an Assertion - based Multi - View Fusion Network (AMVNet), aiming to combine the advantages of the Range View (RV) and Bird - Eye View (BEV) networks to achieve more accurate point - cloud semantic segmentation. The following are the specific problems described in the paper: 1. **Limitations of single - view methods**: - The Range View (RV) method performs well when dealing with near - distance objects (such as parking spaces and roads), but in the case of long - distance or dense point clouds, there may be a problem where multiple 3D points are projected onto the same pixel, resulting in inaccurate representation. - The Bird - Eye View (BEV) method performs well when dealing with long - distance objects, but has difficulties in representing sparse point clouds and objects in the vertical direction. 2. **Need for multi - view fusion**: - Since different view methods have their own advantages and disadvantages in different scenarios, simply relying on one view method cannot obtain the optimal result. Therefore, a method that can effectively fuse multiple view information is needed to fully utilize the advantages of each method. 3. **Deficiencies of existing fusion methods**: - Most of the existing multi - view fusion methods focus on early - feature fusion or sequential fusion, and these methods will encounter the problem of large computational overhead during late - stage fusion. In addition, how to effectively select uncertain points for processing is also a challenge. ### Solutions proposed in the paper To address the above problems, the paper proposes the following solutions: - **Assertion - based Multi - View Fusion Network (AMVNet)**: - **Multi - view network**: By projecting the point cloud onto a structured representation form (such as RV and BEV), and using an encoder - decoder network for semantic segmentation, an initial point - level category prediction is obtained. - **Assertion - guided point sampling strategy**: According to the prediction differences of categories by the RV and BEV networks, uncertain points are selected for further processing. Specifically, the cosine similarity score between the two network predictions is calculated, and a threshold is set to mark the uncertain points. - **Nod - head architecture**: For each uncertain point, its point - level features and the features of its neighboring points are extracted and passed to a lightweight nod - head network to obtain the final prediction result. ### Experimental results The paper conducted experiments on two benchmark datasets, SemanticKITTI and nuScenes. The results show that AMVNet achieves better performance than the baseline methods in multiple categories and reaches the state - of - the - art level in the mIOU metric. In particular, AMVNet performs excellently in categories such as bicycles, motorcycles, and pedestrians, proving its effectiveness in dealing with complex scenes. ### Summary By introducing the Assertion - based Multi - View Fusion Network (AMVNet), the paper successfully solves the problem of inconsistent performance of single - view methods in different scenarios and achieves more accurate LiDAR point - cloud semantic segmentation. This method not only improves the segmentation accuracy but also maintains a low computational overhead, which is suitable for application scenarios with limited computing resources such as autonomous driving.

AMVNet: Assertion-based Multi-View Fusion Network for LiDAR Semantic Segmentation

Attention-based Multi-modal Fusion Network for Semantic Scene Completion.

MVG-Net: LiDAR Point Cloud Semantic Segmentation Network Integrating Multi-View Images

Multi-View Adaptive Fusion Network for 3D Object Detection

Multi-View Feature Fusion and Rich Information Refinement Network for Semantic Segmentation of Remote Sensing Images

MFVNet: a deep adaptive fusion network with multiple field-of-views for remote sensing image semantic segmentation

(AF)2-S3Net: Attentive Feature Fusion with Adaptive Feature Selection for Sparse Semantic Segmentation Network

RGB and LiDAR Fusion-based 3D Semantic Segmentation for Autonomous Driving

Similarity-Aware Fusion Network for 3D Semantic Segmentation

LACV-Net: Semantic Segmentation of Large-Scale Point Cloud Scene via Local Adaptive and Comprehensive VLAD

Multi-View PointNet for 3D Scene Understanding

APPFNet: Adaptive point-pixel fusion network for 3D semantic segmentation with neighbor feature aggregation

A Multi-phase Camera-LiDAR Fusion Network for 3D Semantic Segmentation with Weak Supervision

LMFNet: An Efficient Multimodal Fusion Approach for Semantic Segmentation in High-Resolution Remote Sensing

AMFF-Net: An Effective 3D Object Detector Based on Attention and Multi-Scale Feature Fusion

AMBrnet: Asymmetric Multi-Branch Residual Network for LiDAR Semantic Segmentation

AM3Net: Adaptive Mutual-learning-based Multimodal Data Fusion Network

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View Representation

MAFNet: dual-branch fusion network with multiscale atrous pyramid pooling aggregate contextual features for real-time semantic segmentation

MVF-Net: A Multi-view Fusion Network for Event-based Object Classification

Semantic Segmentation of LiDAR Point Cloud Based on CAFF-PointNet