Abstract:Recent deep learning based visual simultaneous localization and mapping (SLAM) methods have made significant progress. However, how to make full use of visual information as well as better integrate with inertial measurement unit (IMU) in visual SLAM has potential research value. This paper proposes a novel deep SLAM network with dual visual factors. The basic idea is to integrate both photometric factor and re-projection factor into the end-to-end differentiable structure through multi-factor data association module. We show that the proposed network dynamically learns and adjusts the confidence maps of both visual factors and it can be further extended to include the IMU factors as well. Extensive experiments validate that our proposed method significantly outperforms the state-of-the-art methods on several public datasets, including TartanAir, EuRoC and ETH3D-SLAM. Specifically, when dynamically fusing the three factors together, the absolute trajectory error for both monocular and stereo configurations on EuRoC dataset has reduced by 45.3% and 36.2% respectively.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problem of how to make full use of visual information in Visual Simultaneous Localization and Mapping (VSLAM) and better integrate the Inertial Measurement Unit (IMU). Specifically, the author proposes a new deep - learning framework named DVI - SLAM (Dual Visual Inertial SLAM network) in order to achieve the following goals: 1. **Make full use of visual information**: By introducing dual - visual factors (feature - metric factor and reprojection factor), the visual information in the image can be used more comprehensively. 2. **Effectively fuse IMU information**: Closely combine IMU data with visual information to improve the robustness and accuracy of the system in fast - motion scenarios. 3. **End - to - end differentiable structure**: Design a multi - factor data - association module so that the network can dynamically learn and adjust the confidence maps of different factors, thereby achieving end - to - end optimization. ### Main contributions of the paper - **Propose the DVI - SLAM network**: This network significantly improves the camera - pose - estimation accuracy in complex scenes by learning confidence maps and dynamically fusing reprojection, feature - metric and IMU factors. - **Flexibly support multiple sensor configurations**: Whether it is a monocular or binocular camera, and whether it contains a depth or IMU sensor, this framework can be flexibly adapted. - **Surpass existing methods**: On multiple public datasets (such as TartanAir, EuRoC and ETH3D - SLAM), the performance of DVI - SLAM is better than that of the current state - of - the - art methods. ### Formula summary The formulas involved in the paper include, but are not limited to, the following: 1. **Reprojection residual**: \[ E_r(T,d)=\|x^*_{ij}-\Pi(T_i,T_j,d_i,x_i)\|_{P_r}^2,\quad X_r = \text{diag}(w_r) \] where \(\|\cdot\|_{P_r}\) represents the Mahalanobis distance, and \(w_r\) is the reprojection confidence map. 2. **Feature - metric residual**: \[ E_f(T,d)=\|F_a(x_i)-F_a(\Pi(T_i,T_j,d_i,x_i))\|_{P_f}^2,\quad X_f=\text{diag}(w_f) \] 3. **Inertial residual**: \[ E_u(T,M)=\|e_T + e_M\|^2_{P_u},\quad X_u=\text{diag}(w_{imu}) \] where \[ e_T=\left[ \begin{array}{c} \log((\Delta R_{ij}\exp(\Delta J_r(b_{gi}-\hat{b}_{gi})))^T R_i^T R_j)\\ R_i^T(p_j - p_i - v_i\Delta t_{ij}-\frac{1}{2}g t_{ij}^2)-(\Delta p_{ij}+\Delta J_p(b_i-\hat{b}_i)) \end{array} \right] \] \[ e_M=\left[ \begin{array}{c} R_i^T(v_j - v_i - g t_{ij})-(\Delta v_{ij}+\Delta J_v(b_i-\hat{b}_i))\\ b_{aj}-b_{ai}\\ b_{gj}-b_{gi} \end{array} \right]

DVI-SLAM: A Dual Visual Inertial SLAM Network

DiT-SLAM: Real-Time Dense Visual-Inertial SLAM with Implicit Depth Representation and Tightly-Coupled Graph Optimization

DVT-SLAM: Deep-Learning Based Visible and Thermal Fusion SLAM

InertialNet: Toward Robust SLAM Via Visual Inertial Measurement.

DXSLAM: A Robust and Efficient Visual SLAM System with Deep Features.

HVL-SLAM: Hybrid Vision and LiDAR Fusion for SLAM

Visual-LiDAR SLAM Based on Unsupervised Multi-channel Deep Neural Networks

A Real-Time Dynamic SLAM Algorithm Based on the Fusion of Visual, Inertial, and Semantic Information

Design of visual inertial state estimator for autonomous systems via multi-sensor fusion approach

LDVI-SLAM: A Lightweight Monocular Visual-Inertial SLAM System for Dynamic Environments Based on Motion Constraints

An improved SLAM based on RK-VIF: Vision and inertial information fusion via Runge-Kutta method

A Monocular Visual SLAM System Augmented by Lightweight Deep Local Feature Extractor Using In-House and Low-Cost LIDAR-camera Integrated Device

A real-time, robust and versatile visual-SLAM framework based on deep learning networks

LIFT-SLAM: A deep-learning feature-based monocular visual SLAM method

DLD-SLAM: RGB-D Visual Simultaneous Localisation and Mapping in Indoor Dynamic Environments Based on Deep Learning

A Multisensor Fusion With Automatic Vision–LiDAR Calibration Based on Factor Graph Joint Optimization for SLAM

PLE-SLAM: A Visual-Inertial SLAM Based on Point-Line Features and Efficient IMU Initialization

Stereo Vision SLAM Based on Feature Extraction Network

Visual Inertial SLAM Based on Spatiotemporal Consistency Optimization in Diverse Environments

DM-SLAM: A Feature-Based SLAM System for Rigid Dynamic Scenes

Sensor Fusion SLAM: An Efficient and Robust SLAM system for Dynamic Environments