Abstract:The ability to produce large-scale maps for nav-igation, path planning and other tasks is a crucial step for autonomous agents, but has always been challenging. In this work, we introduce BEV-SLAM, a novel type of graph-based SLAM that aligns semantically-segmented Bird's Eye View (BEV) predictions from monocular cameras. We introduce a novel form of occlusion reasoning into BEV estimation and demonstrate its importance to aid spatial aggregation of BEV predictions. The result is a versatile SLAM system that can operate across arbitrary multi-camera configurations and can be seamlessly integrated with other sensors. We show that the use of multiple cameras significantly increases performance, and achieves lower relative error than high-performance GPS. The resulting system is able to create large, dense, globally-consistent world maps from monocular cameras mounted around an ego vehicle. The maps are metric and correctly-scaled, making them suitable for downstream navigation tasks.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to use a monocular camera to create a large - scale, globally consistent world map for tasks such as navigation and path planning. Specifically, the researchers introduced a novel graph - optimization SLAM system named BEV - SLAM, which achieves this goal by aligning the Bird's Eye View (BEV) predictions of semantic segmentation. ### Main problems 1. **Creating large - scale maps**: Mobile autonomous agents need an information - rich environmental representation for navigation, planning, and localization, but creating such a representation has always been a challenge. 2. **Limitations of monocular vision**: Using a monocular camera for SLAM has problems such as the lack of three - dimensional cues and scale loss. 3. **Integration of multi - camera configurations**: How to effectively use multiple monocular cameras to improve performance and ensure the correctness and consistency of the results. ### Solutions BEV - SLAM solves the above problems in the following ways: - **Semantic segmentation and BEV prediction**: Use a convolutional neural network (CNN) to directly map from an image to a semantically - labeled BEV map, thereby obtaining a properly scaled map that can be easily integrated with other sensors or maps. - **Occlusion reasoning**: Introduce an occlusion reasoning mechanism to ensure temporal consistency, especially in occluded areas, which helps spatially aggregate BEV predictions. - **Multi - camera support**: The system can handle any multi - camera configuration, and significantly improves performance and reduces relative error through multi - camera configurations. - **Global consistency**: The generated large - scale dense map is metrically correct and suitable for downstream navigation tasks. ### Formula explanations The formulas involved in the paper include: - **Probability distribution**: \[ P(x_k, m|Z_{0:k}, U_{0:k}, x_0) \] where \(x_k\) is the vehicle - self pose, \(m\) is the BEV map, \(Z\) is the landmark observation, and \(U\) is the map alignment. - **Dice loss function**: \[ L_{\text{dice}} = 1-\frac{1}{|C|}\sum_{c = 1}^{|C|}\frac{2\sum_i\hat{t}_{ic}t_{ic}}{\sum_i\hat{t}_{ic}+\sum_i t_{ic}+\epsilon} \] where \(\hat{t}_{ic}\) is the ground truth, \(t_{ic}\) is the network prediction, and \(\epsilon\) is a small constant to prevent division - by - zero errors. - **Optimal alignment**: \[ (\Delta x^*, \Delta y^*, \Delta\theta^*)=\arg\min_{\Delta x, \Delta y, \Delta\theta}\|i_r\|-\|i_w\|\odot M_{\text{occr}}\odot M_{\text{occw}}\|_2^2 \] where \(i_r\) and \(M_{\text{occr}}\) are the reference BEV map and its corresponding binary occlusion mask respectively, and \(i_w\) and \(M_{\text{occw}}\) are the transformed maps. Through these methods, BEV - SLAM can create high - quality globally consistent maps in complex environments, which are suitable for various navigation and planning tasks.

BEV-SLAM: Building a Globally-Consistent World Map Using Monocular Vision

From Satellite to Ground: Satellite Assisted Visual Localization with Cross-view Semantic Matching

Monocular SLAM for Large Scale Scenes

Orbeez-SLAM: A Real-time Monocular Visual SLAM with ORB Features and NeRF-realized Mapping

BirdSLAM: Monocular Multibody SLAM in Bird's-Eye View

SBC-SLAM: Semantic Bioinspired Collaborative SLAM for Large-Scale Environment Perception of Heterogeneous Systems

Bifocal-Binocular Visual SLAM System for Repetitive Large-Scale Environments

Fusion of Monocular Vision and Radio-based Ranging for Global Scale Estimation and Drift Mitigation

Large-Scale Monocular Slam By Local Bundle Adjustment And Map Joining

BLOS-BEV: Navigation Map Enhanced Lane Segmentation Network, Beyond Line of Sight

Monocular Vision SLAM for Large Scale Outdoor Environment

BEV-Seg: Bird's Eye View Semantic Segmentation Using Geometry and Semantic Point Cloud

Multi-camera visual SLAM for autonomous navigation of micro aerial vehicles

Understanding Bird's-Eye View of Road Semantics using an Onboard Camera

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Tightly-Coupled LiDAR-Visual-Inertial SLAM and Large-Scale Volumetric Occupancy Mapping

LetsMap: Unsupervised Representation Learning for Semantic BEV Mapping

Unified framework for recognition, localization and mapping using wearable cameras

AVM-SLAM: Semantic Visual SLAM with Multi-Sensor Fusion in a Bird's Eye View for Automated Valet Parking

Multicam-SLAM: Non-overlapping Multi-camera SLAM for Indirect Visual Localization and Navigation

A Monocular Visual SLAM System Augmented by Lightweight Deep Local Feature Extractor Using In-House and Low-Cost LIDAR-camera Integrated Device