Mark Johnson,Oscar Alejandro Mendez Maldonado,Avishkar Saha,R. Bowden,James A. Ross
Abstract:The ability to produce large-scale maps for nav-igation, path planning and other tasks is a crucial step for autonomous agents, but has always been challenging. In this work, we introduce BEV-SLAM, a novel type of graph-based SLAM that aligns semantically-segmented Bird's Eye View (BEV) predictions from monocular cameras. We introduce a novel form of occlusion reasoning into BEV estimation and demonstrate its importance to aid spatial aggregation of BEV predictions. The result is a versatile SLAM system that can operate across arbitrary multi-camera configurations and can be seamlessly integrated with other sensors. We show that the use of multiple cameras significantly increases performance, and achieves lower relative error than high-performance GPS. The resulting system is able to create large, dense, globally-consistent world maps from monocular cameras mounted around an ego vehicle. The maps are metric and correctly-scaled, making them suitable for downstream navigation tasks.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to use a monocular camera to create a large - scale, globally consistent world map for tasks such as navigation and path planning. Specifically, the researchers introduced a novel graph - optimization SLAM system named BEV - SLAM, which achieves this goal by aligning the Bird's Eye View (BEV) predictions of semantic segmentation.
### Main problems
1. **Creating large - scale maps**: Mobile autonomous agents need an information - rich environmental representation for navigation, planning, and localization, but creating such a representation has always been a challenge.
2. **Limitations of monocular vision**: Using a monocular camera for SLAM has problems such as the lack of three - dimensional cues and scale loss.
3. **Integration of multi - camera configurations**: How to effectively use multiple monocular cameras to improve performance and ensure the correctness and consistency of the results.
### Solutions
BEV - SLAM solves the above problems in the following ways:
- **Semantic segmentation and BEV prediction**: Use a convolutional neural network (CNN) to directly map from an image to a semantically - labeled BEV map, thereby obtaining a properly scaled map that can be easily integrated with other sensors or maps.
- **Occlusion reasoning**: Introduce an occlusion reasoning mechanism to ensure temporal consistency, especially in occluded areas, which helps spatially aggregate BEV predictions.
- **Multi - camera support**: The system can handle any multi - camera configuration, and significantly improves performance and reduces relative error through multi - camera configurations.
- **Global consistency**: The generated large - scale dense map is metrically correct and suitable for downstream navigation tasks.
### Formula explanations
The formulas involved in the paper include:
- **Probability distribution**:
\[
P(x_k, m|Z_{0:k}, U_{0:k}, x_0)
\]
where \(x_k\) is the vehicle - self pose, \(m\) is the BEV map, \(Z\) is the landmark observation, and \(U\) is the map alignment.
- **Dice loss function**:
\[
L_{\text{dice}} = 1-\frac{1}{|C|}\sum_{c = 1}^{|C|}\frac{2\sum_i\hat{t}_{ic}t_{ic}}{\sum_i\hat{t}_{ic}+\sum_i t_{ic}+\epsilon}
\]
where \(\hat{t}_{ic}\) is the ground truth, \(t_{ic}\) is the network prediction, and \(\epsilon\) is a small constant to prevent division - by - zero errors.
- **Optimal alignment**:
\[
(\Delta x^*, \Delta y^*, \Delta\theta^*)=\arg\min_{\Delta x, \Delta y, \Delta\theta}\|i_r\|-\|i_w\|\odot M_{\text{occr}}\odot M_{\text{occw}}\|_2^2
\]
where \(i_r\) and \(M_{\text{occr}}\) are the reference BEV map and its corresponding binary occlusion mask respectively, and \(i_w\) and \(M_{\text{occw}}\) are the transformed maps.
Through these methods, BEV - SLAM can create high - quality globally consistent maps in complex environments, which are suitable for various navigation and planning tasks.