Abstract:<p>When considering the robot application of the complex scenarios, the traditional geometric maps are insufficient because of the lack of interactions with the environment. In this paper, a three-dimensional (3D) semantic map with large-scale and accurate integrating Lidar and camera information is presented to achieve real-time road scenes. Firstly, simultaneous localization and mapping (SLAM) is performed to locate the robot position with the multi-sensor fusion of the Lidar and inertial measurement unit (IMU), and the map of the surrounding scenes is constructed while the robot is moving. Moreover, a convolutional neural networks (CNNs)-based semantic segmentation of images is employed to develop the semantic map of the environment. Following the synchronization of the time and space, the sensor fusion of Lidar and camera are used to generate the semantic labeled frame of point clouds and then create a semantic map in term of the posture. Besides, improving the capacity of classification, a higher-order 3D full connection conditional random fields (CRFs) method is utilized to optimize the semantic map. Finally, extensive experiment results evaluated on the KITTI dataset have illustrated the effectiveness of the proposed method.</p>

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the problem of the insufficiency of traditional geometric maps due to the lack of interaction with the environment when robots are applied in complex scenarios. Specifically, the paper proposes a method for constructing and optimizing large - scale, high - precision 3D semantic maps based on the fusion of Lidar and cameras to achieve the perception of real - time road scenes. #### Problem background 1. **Limitations of traditional geometric maps**: - Traditional geometric maps only provide geometric information of the environment and cannot meet the robot's need for understanding the environment. - The 3D environmental information constructed by a single sensor (such as an RGB - D camera or a binocular camera) has problems such as high computational complexity, poor real - time performance, and being affected by illumination and texture, and it is difficult to achieve satisfactory performance in large - scale and complex outdoor environments. 2. **Advantages of multi - sensor fusion**: - Lidar can accurately obtain 3D data of objects at a long distance, and has high stability and flexibility, which is suitable for real - time positioning and map construction. - Cameras can directly obtain semantic information through deep learning (such as convolutional neural networks (CNNs)) for image semantic segmentation, avoiding the complexity of stereo - matching calculations. #### Solutions The method proposed in the paper combines the multi - sensor fusion technology of Lidar and cameras, solves the above - mentioned problems, and ensures the real - time and accuracy requirements in the map - building process. The main contributions include: 1. **Multi - sensor fusion for constructing real - time 3D semantic maps**: - By combining the information of Lidar and cameras, the problems of small application range and high computational complexity of traditional RGB - D and binocular vision sensors are solved. 2. **Fast semantic segmentation architecture based on optimized PSPNet - 50**: - A new network structure based on the simplified PSPNet - 50 is proposed, which makes a trade - off between speed and accuracy to meet the needs of semantic map construction. 3. **Optimization of high - order 3D fully - connected conditional random fields (CRFs) model**: - A high - order 3D CRFs model is designed to optimize the initial semantic map, further improving the accuracy of the 3D semantic map results. Through these methods, the paper realizes the efficient and accurate construction of 3D semantic maps and provides solutions for advanced scene - interaction problems (such as target crawling and object searching), thereby improving the efficiency of navigation, positioning, and autonomous driving.

Building and optimization of 3D semantic map based on Lidar and camera fusion

ObjectFusion: an Object Detection and Segmentation Framework with RGB-D SLAM and Convolutional Neural Networks

RS-SLAM: Real time semantic slam with driverless car using LiDAR-Camera-IMU sensing

Towards a Meaningful 3D Map Using a 3D Lidar and a Camera

Large-Scale 3D Semantic Mapping Using Monocular Vision

Object-aware Semantic Mapping of Indoor Scenes Using Octomap

SLAM and 3D Semantic Reconstruction Based on the Fusion of Lidar and Monocular Vision

Multimodal sensor-based semantic 3D mapping for a large-scale environment

DeLS-3D: Deep Localization and Segmentation with a 3D Semantic Map

LISNeRF Mapping: LiDAR-based Implicit Mapping via Semantic Neural Fields for Large-Scale 3D Scenes

An Approach for Construct Semantic Map with Scene Classification and Object Semantic Segmentation

Research on SLAM Algorithm of Mobile Robot Based on the Fusion of 2D LiDAR and Depth Camera

Cooperative indoor 3D mapping and modeling using LiDAR data

Semi-Dense 3D Semantic Mapping from Monocular SLAM

Stereo and LiDAR Loosely Coupled SLAM Constrained Ground Detection

Hybrid Semi-Dense 3D Semantic-Topological Mapping From Stereo Visual-Inertial Odometry SLAM With Loop Closure Detection

Robust 3D Semantic Segmentation Method Based on Multi-Modal Collaborative Learning

Research on Indoor 3D Reconstruction Technology Based on Semantic Visual Simultaneous Localization and Mapping

A Mobile Robot Visual SLAM System With Enhanced Semantics Segmentation

Semantic-Assisted LIDAR Tightly Coupled SLAM for Dynamic Environments

Multi-Objective Location and Mapping Based on Deep Learning and Visual Slam