DVI-SLAM: A Dual Visual Inertial SLAM Network

Xiongfeng Peng,Zhihua Liu,Weiming Li,Ping Tan,SoonYong Cho,Qiang Wang
2024-05-26
Abstract:Recent deep learning based visual simultaneous localization and mapping (SLAM) methods have made significant progress. However, how to make full use of visual information as well as better integrate with inertial measurement unit (IMU) in visual SLAM has potential research value. This paper proposes a novel deep SLAM network with dual visual factors. The basic idea is to integrate both photometric factor and re-projection factor into the end-to-end differentiable structure through multi-factor data association module. We show that the proposed network dynamically learns and adjusts the confidence maps of both visual factors and it can be further extended to include the IMU factors as well. Extensive experiments validate that our proposed method significantly outperforms the state-of-the-art methods on several public datasets, including TartanAir, EuRoC and ETH3D-SLAM. Specifically, when dynamically fusing the three factors together, the absolute trajectory error for both monocular and stereo configurations on EuRoC dataset has reduced by 45.3% and 36.2% respectively.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### What problems does this paper attempt to solve? This paper aims to solve the problem of how to make full use of visual information in Visual Simultaneous Localization and Mapping (VSLAM) and better integrate the Inertial Measurement Unit (IMU). Specifically, the author proposes a new deep - learning framework named DVI - SLAM (Dual Visual Inertial SLAM network) in order to achieve the following goals: 1. **Make full use of visual information**: By introducing dual - visual factors (feature - metric factor and reprojection factor), the visual information in the image can be used more comprehensively. 2. **Effectively fuse IMU information**: Closely combine IMU data with visual information to improve the robustness and accuracy of the system in fast - motion scenarios. 3. **End - to - end differentiable structure**: Design a multi - factor data - association module so that the network can dynamically learn and adjust the confidence maps of different factors, thereby achieving end - to - end optimization. ### Main contributions of the paper - **Propose the DVI - SLAM network**: This network significantly improves the camera - pose - estimation accuracy in complex scenes by learning confidence maps and dynamically fusing reprojection, feature - metric and IMU factors. - **Flexibly support multiple sensor configurations**: Whether it is a monocular or binocular camera, and whether it contains a depth or IMU sensor, this framework can be flexibly adapted. - **Surpass existing methods**: On multiple public datasets (such as TartanAir, EuRoC and ETH3D - SLAM), the performance of DVI - SLAM is better than that of the current state - of - the - art methods. ### Formula summary The formulas involved in the paper include, but are not limited to, the following: 1. **Reprojection residual**: \[ E_r(T,d)=\|x^*_{ij}-\Pi(T_i,T_j,d_i,x_i)\|_{P_r}^2,\quad X_r = \text{diag}(w_r) \] where \(\|\cdot\|_{P_r}\) represents the Mahalanobis distance, and \(w_r\) is the reprojection confidence map. 2. **Feature - metric residual**: \[ E_f(T,d)=\|F_a(x_i)-F_a(\Pi(T_i,T_j,d_i,x_i))\|_{P_f}^2,\quad X_f=\text{diag}(w_f) \] 3. **Inertial residual**: \[ E_u(T,M)=\|e_T + e_M\|^2_{P_u},\quad X_u=\text{diag}(w_{imu}) \] where \[ e_T=\left[ \begin{array}{c} \log((\Delta R_{ij}\exp(\Delta J_r(b_{gi}-\hat{b}_{gi})))^T R_i^T R_j)\\ R_i^T(p_j - p_i - v_i\Delta t_{ij}-\frac{1}{2}g t_{ij}^2)-(\Delta p_{ij}+\Delta J_p(b_i-\hat{b}_i)) \end{array} \right] \] \[ e_M=\left[ \begin{array}{c} R_i^T(v_j - v_i - g t_{ij})-(\Delta v_{ij}+\Delta J_v(b_i-\hat{b}_i))\\ b_{aj}-b_{ai}\\ b_{gj}-b_{gi} \end{array} \right]