Abstract:Light Detection and Ranging (LiDAR) and camera are two commonly used sensors to acquire data of different modalities in environmental perception. For autonomous vehicles operating in unstructured scenes, it is particularly important to fuse these two different data for semantic segmentation. Most existing methods rely solely on data from a single sensor, which partially limits segmentation performance. Some fusion-based methods, however, are unable to trade off the contribution of point cloud features and image features, resulting in mediocre performance. To address these issues, we propose a multisensor fusion network for unstructured scene segmentation with surface normal (SN) incorporated, called MF-SN Net. For the first factor, we effectively use complementary features to highlight the characteristics of unstructured scenes. Point represented and range view (RV) represented LiDAR information are combined as a baseline to fully utilize 3-D information while ensuring efficiency. RGB images from the camera are reweighted and fused with RV-represented LiDAR features at different scales. Second, we propose a novel method to reweight image features before fusing them into LiDAR features using SN information, which can effectively reduce the negative impact of inaccurate image features. Third, we design a cross-layer attention module to enhance semantic information from high-level features to different layers, which can optimize feature extraction from original point clouds. What is more, we make a synthesized dataset using CARLA simulator to enrich the experimental scenes, so that the network's performance can be evaluated in various conditions. Experimental results on different datasets demonstrate the effectiveness and robustness of our network, showing its competitiveness with state-of-the-art methods.

Transformer-Based Cross-Modal Information Fusion Network for Semantic Segmentation

NLFNet: Non-Local Fusion Towards Generalized Multimodal Semantic Segmentation Across RGB-Depth, Polarization, and Thermal Images

CLFusion:3D Semantic Segmentation Based on Camera and Lidar Fusion

TCFNet: Transformer and CNN Fusion Model for LiDAR Point Cloud Semantic Segmentation.

A Crossmodal Multiscale Fusion Network for Semantic Segmentation of Remote Sensing Data

Robust 3D Semantic Segmentation Method Based on Multi-Modal Collaborative Learning

Multispectral Fusion Transformer Network for RGB-Thermal Urban Scene Semantic Segmentation

MFTransNet: A Multi-Modal Fusion with CNN-Transformer Network for Semantic Segmentation of HSR Remote Sensing Images

TCNet: Multiscale Fusion of Transformer and CNN for Semantic Segmentation of Remote Sensing Images

A Semantic-Aware and Multi-Guided Network for Infrared-Visible Image Fusion

CMDFusion: Bidirectional Fusion Network with Cross-modality Knowledge Distillation for LIDAR Semantic Segmentation

A Transformer-based Multi-Modal Fusion Network for Semantic Segmentation of High-Resolution Remote Sensing Imagery

A Multi-phase Camera-LiDAR Fusion Network for 3D Semantic Segmentation with Weak Supervision

DefFusion: Deformable Multimodal Representation Fusion for 3D Semantic Segmentation

RGB and LiDAR Fusion-based 3D Semantic Segmentation for Autonomous Driving

Complementarity-aware cross-modal feature fusion network for RGB-T semantic segmentation

LIF-Seg: LiDAR and Camera Image Fusion for 3D LiDAR Semantic Segmentation

Cross-modal Attention Fusion Network for RGB-D Semantic Segmentation

Multi-sensor Fusion Network for Unstructured Scene Segmentation with Surface Normal Incorporated

SIESEF-FusionNet: Spatial Inter-correlation Enhancement and Spatially-Embedded Feature Fusion Network for LiDAR Point Cloud Semantic Segmentation