Abstract:We introduce a new system for Multi-Session SLAM, which tracks camera motion across multiple disjoint videos under a single global reference. Our approach couples the prediction of optical flow with solver layers to estimate camera pose. The backbone is trained end-to-end using a novel differentiable solver for wide-baseline two-view pose. The full system can connect disjoint sequences, perform visual odometry, and global optimization. Compared to existing approaches, our design is accurate and robust to catastrophic failures. Code is available at
What problem does this paper attempt to address?
The problem that this paper attempts to solve is Multi - Session Simultaneous Localization and Mapping (Multi - Session SLAM). Specifically, the researchers proposed a new system to handle camera motion tracking from multiple discontinuous video streams and unify them into a global reference frame. The following are the key issues described in the paper:
1. **Challenges of Multi - Session SLAM**:
- Traditional SLAM tasks assume that the input is a continuous video stream.
- However, in practical applications, video data are often composed of multiple discontinuous segments, which may be intentional (such as collaborative 3D reconstruction) or due to visual discontinuities in the video stream (such as camera failure, extreme parallax, sharp turns, automatic exposure delay, dark areas, or severe occlusion of dynamic objects).
2. **Limitations of Existing Methods**:
- Most existing Multi - Session SLAM methods rely on additional sensor data to eliminate the scale degree of freedom and simplify tracking.
- Only a few methods (such as CCM - SLAM and ORB - SLAM3) support Multi - Session SLAM using only monocular RGB video, but these methods are based on classical feature descriptors and have lower average accuracy.
- Other deep - learning methods (such as DROID - SLAM and DPVO), although performing well on a single continuous video, cannot handle large - baseline matching and non - local optimization, so they are not suitable for Multi - Session SLAM.
3. **The Method Proposed in This Paper**:
- A new differentiable solver layer is introduced to minimize the Symmetric Epipolar Distance (SED) of bidirectional optical flow, thereby estimating the camera pose.
- A unified backbone architecture is proposed, which can handle large - baseline relative pose estimation and visual odometry simultaneously.
- By iteratively updating the optical flow and camera pose, this method can establish connections between multiple discontinuous video streams, perform visual odometry, and conduct global optimization.
4. **Experimental Verification**:
- Evaluations were carried out on challenging real - world datasets such as EuRoC MAV and ETH3D, and the results show that this method is more accurate and robust than existing methods.
- The two - view pose estimation method was separately evaluated on the Scannet and Megadepth datasets, and the results show that its performance is comparable to that of Transformer - based matching networks, especially in long - distance view matching.
In conclusion, this paper aims to solve the key problems in Multi - Session SLAM, that is, how to perform accurate camera pose estimation and global optimization between multiple discontinuous video segments.