BEV-SLAM: Building a Globally-Consistent World Map Using Monocular Vision

Mark Johnson,Oscar Alejandro Mendez Maldonado,Avishkar Saha,R. Bowden,James A. Ross
DOI: https://doi.org/10.1109/IROS47612.2022.9981258
2022-10-23
Abstract:The ability to produce large-scale maps for nav-igation, path planning and other tasks is a crucial step for autonomous agents, but has always been challenging. In this work, we introduce BEV-SLAM, a novel type of graph-based SLAM that aligns semantically-segmented Bird's Eye View (BEV) predictions from monocular cameras. We introduce a novel form of occlusion reasoning into BEV estimation and demonstrate its importance to aid spatial aggregation of BEV predictions. The result is a versatile SLAM system that can operate across arbitrary multi-camera configurations and can be seamlessly integrated with other sensors. We show that the use of multiple cameras significantly increases performance, and achieves lower relative error than high-performance GPS. The resulting system is able to create large, dense, globally-consistent world maps from monocular cameras mounted around an ego vehicle. The maps are metric and correctly-scaled, making them suitable for downstream navigation tasks.
Computer Science,Engineering
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use a monocular camera to create a large - scale, globally consistent world map for tasks such as navigation and path planning. Specifically, the researchers introduced a novel graph - optimization SLAM system named BEV - SLAM, which achieves this goal by aligning the Bird's Eye View (BEV) predictions of semantic segmentation. ### Main problems 1. **Creating large - scale maps**: Mobile autonomous agents need an information - rich environmental representation for navigation, planning, and localization, but creating such a representation has always been a challenge. 2. **Limitations of monocular vision**: Using a monocular camera for SLAM has problems such as the lack of three - dimensional cues and scale loss. 3. **Integration of multi - camera configurations**: How to effectively use multiple monocular cameras to improve performance and ensure the correctness and consistency of the results. ### Solutions BEV - SLAM solves the above problems in the following ways: - **Semantic segmentation and BEV prediction**: Use a convolutional neural network (CNN) to directly map from an image to a semantically - labeled BEV map, thereby obtaining a properly scaled map that can be easily integrated with other sensors or maps. - **Occlusion reasoning**: Introduce an occlusion reasoning mechanism to ensure temporal consistency, especially in occluded areas, which helps spatially aggregate BEV predictions. - **Multi - camera support**: The system can handle any multi - camera configuration, and significantly improves performance and reduces relative error through multi - camera configurations. - **Global consistency**: The generated large - scale dense map is metrically correct and suitable for downstream navigation tasks. ### Formula explanations The formulas involved in the paper include: - **Probability distribution**: \[ P(x_k, m|Z_{0:k}, U_{0:k}, x_0) \] where \(x_k\) is the vehicle - self pose, \(m\) is the BEV map, \(Z\) is the landmark observation, and \(U\) is the map alignment. - **Dice loss function**: \[ L_{\text{dice}} = 1-\frac{1}{|C|}\sum_{c = 1}^{|C|}\frac{2\sum_i\hat{t}_{ic}t_{ic}}{\sum_i\hat{t}_{ic}+\sum_i t_{ic}+\epsilon} \] where \(\hat{t}_{ic}\) is the ground truth, \(t_{ic}\) is the network prediction, and \(\epsilon\) is a small constant to prevent division - by - zero errors. - **Optimal alignment**: \[ (\Delta x^*, \Delta y^*, \Delta\theta^*)=\arg\min_{\Delta x, \Delta y, \Delta\theta}\|i_r\|-\|i_w\|\odot M_{\text{occr}}\odot M_{\text{occw}}\|_2^2 \] where \(i_r\) and \(M_{\text{occr}}\) are the reference BEV map and its corresponding binary occlusion mask respectively, and \(i_w\) and \(M_{\text{occw}}\) are the transformed maps. Through these methods, BEV - SLAM can create high - quality globally consistent maps in complex environments, which are suitable for various navigation and planning tasks.