Abstract:Current techniques in Visual Simultaneous Localization and Mapping (VSLAM) estimate camera displacement by comparing image features of consecutive scenes. These algorithms depend on scene continuity, hence requires frequent camera inputs. However, processing images frequently can lead to significant memory usage and computation overhead. In this study, we introduce SemanticSLAM, an end-to-end visual-inertial odometry system that utilizes semantic features extracted from an RGB-D sensor. This approach enables the creation of a semantic map of the environment and ensures reliable camera localization. SemanticSLAM is scene-agnostic, which means it doesn't require retraining for different environments. It operates effectively in indoor settings, even with infrequent camera input, without prior knowledge. The strength of SemanticSLAM lies in its ability to gradually refine the semantic map and improve pose estimation. This is achieved by a convolutional long-short-term-memory (ConvLSTM) network, trained to correct errors during map construction. Compared to existing VSLAM algorithms, SemanticSLAM improves pose estimation by 17%. The resulting semantic map provides interpretable information about the environment and can be easily applied to various downstream tasks, such as path planning, obstacle avoidance, and robot navigation. The code will be publicly available at

What problem does this paper attempt to address?

The paper aims to address several key issues in Visual Simultaneous Localization and Mapping (VSLAM) technology, particularly the limitations of traditional VSLAM algorithms when dealing with low-frequency image inputs. Specifically, the paper proposes a new system called SemanticSLAM, whose main objectives include: 1. **Reducing the requirement for continuous scenes**: Traditional VSLAM algorithms rely on the similarity between adjacent frames to estimate camera displacement, which requires frequent acquisition of image data. However, this approach leads to significant memory consumption and computational overhead. SemanticSLAM can work in discontinuous scenes by utilizing semantic features extracted from RGB-D sensors, thereby reducing the need for frequent image processing. 2. **Improving localization accuracy**: By integrating Convolutional Long Short-Term Memory networks (ConvLSTM), SemanticSLAM can progressively refine the semantic map of the environment and improve the accuracy of camera localization. Experimental results show that the pose estimation accuracy of SemanticSLAM is improved by 17% compared to existing VSLAM algorithms. 3. **Enhancing adaptability and interpretability**: The semantic maps constructed by SemanticSLAM can be generalized across different environments (i.e., no need for retraining for each new environment) and are more easily understood by humans. This makes them directly applicable to downstream tasks such as path planning, obstacle avoidance, and robot navigation. 4. **Integrating Inertial Measurement Unit (IMU) information**: To improve localization accuracy at the beginning of tasks, the paper also proposes a method for cross-verifying visual and inertial information. By using low-cost IMU data for initial position estimation, the area for map updates can be narrowed, thereby accelerating the system's convergence process. In summary, SemanticSLAM aims to improve the performance of traditional VSLAM methods by leveraging semantic information and convolutional neural network technology, especially in scenarios with low-frequency image inputs, to achieve more accurate and robust localization and mapping functions.

SemanticSLAM: Learning based Semantic Map Construction and Robust Camera Localization

From Satellite to Ground: Satellite Assisted Visual Localization with Cross-view Semantic Matching

Semantic SLAM Based on Object Detection and Improved Octomap

A Mobile Robot Visual SLAM System With Enhanced Semantics Segmentation

Neural Implicit Dense Semantic SLAM

Semantic visual simultaneous localization and mapping (SLAM) using deep learning for dynamic scenes

Semi-Dense 3D Semantic Mapping from Monocular SLAM

A semantic visual SLAM based on improved mask R-CNN in dynamic environment

Dynamic Visual SLAM Based on Semantic Information and Multi-View Geometry.

SCE-SLAM: a real-time semantic RGBD SLAM system in dynamic scenes based on spatial coordinate error

Survey of simultaneous localization and mapping based on environmental semantic information

MSeg-SLAM: A Semantic Visual SLAM System for Dynamic Scenes.

Monocular Semantic SLAM using Object-pose-graph Constraints

Semantic SLAM for mobile Robots in dynamic environments Based on visual camera sensors

SG-SLAM: A Real-Time RGB-D Visual SLAM Toward Dynamic Scenes With Semantic and Geometric Information

MISD-SLAM: Multimodal Semantic SLAM for Dynamic Environments

Semantic Visual Simultaneous Localization and Mapping: A Survey

Real-Time Visual-Inertial Localization Using Semantic Segmentation Towards Dynamic Environments

RS-SLAM: Real time semantic slam with driverless car using LiDAR-Camera-IMU sensing

MVS‐SLAM: Enhanced multiview geometry for improved semantic RGBD SLAM in dynamic environment

Edge Assisted Mobile Semantic Visual SLAM