SemanticSLAM: Learning based Semantic Map Construction and Robust Camera Localization

Mingyang Li,Yue Ma,Qinru Qiu
2024-01-24
Abstract:Current techniques in Visual Simultaneous Localization and Mapping (VSLAM) estimate camera displacement by comparing image features of consecutive scenes. These algorithms depend on scene continuity, hence requires frequent camera inputs. However, processing images frequently can lead to significant memory usage and computation overhead. In this study, we introduce SemanticSLAM, an end-to-end visual-inertial odometry system that utilizes semantic features extracted from an RGB-D sensor. This approach enables the creation of a semantic map of the environment and ensures reliable camera localization. SemanticSLAM is scene-agnostic, which means it doesn't require retraining for different environments. It operates effectively in indoor settings, even with infrequent camera input, without prior knowledge. The strength of SemanticSLAM lies in its ability to gradually refine the semantic map and improve pose estimation. This is achieved by a convolutional long-short-term-memory (ConvLSTM) network, trained to correct errors during map construction. Compared to existing VSLAM algorithms, SemanticSLAM improves pose estimation by 17%. The resulting semantic map provides interpretable information about the environment and can be easily applied to various downstream tasks, such as path planning, obstacle avoidance, and robot navigation. The code will be publicly available at
Robotics,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address several key issues in Visual Simultaneous Localization and Mapping (VSLAM) technology, particularly the limitations of traditional VSLAM algorithms when dealing with low-frequency image inputs. Specifically, the paper proposes a new system called SemanticSLAM, whose main objectives include: 1. **Reducing the requirement for continuous scenes**: Traditional VSLAM algorithms rely on the similarity between adjacent frames to estimate camera displacement, which requires frequent acquisition of image data. However, this approach leads to significant memory consumption and computational overhead. SemanticSLAM can work in discontinuous scenes by utilizing semantic features extracted from RGB-D sensors, thereby reducing the need for frequent image processing. 2. **Improving localization accuracy**: By integrating Convolutional Long Short-Term Memory networks (ConvLSTM), SemanticSLAM can progressively refine the semantic map of the environment and improve the accuracy of camera localization. Experimental results show that the pose estimation accuracy of SemanticSLAM is improved by 17% compared to existing VSLAM algorithms. 3. **Enhancing adaptability and interpretability**: The semantic maps constructed by SemanticSLAM can be generalized across different environments (i.e., no need for retraining for each new environment) and are more easily understood by humans. This makes them directly applicable to downstream tasks such as path planning, obstacle avoidance, and robot navigation. 4. **Integrating Inertial Measurement Unit (IMU) information**: To improve localization accuracy at the beginning of tasks, the paper also proposes a method for cross-verifying visual and inertial information. By using low-cost IMU data for initial position estimation, the area for map updates can be narrowed, thereby accelerating the system's convergence process. In summary, SemanticSLAM aims to improve the performance of traditional VSLAM methods by leveraging semantic information and convolutional neural network technology, especially in scenarios with low-frequency image inputs, to achieve more accurate and robust localization and mapping functions.