FSNet: Redesign Self-Supervised MonoDepth for Full-Scale Depth Prediction for Autonomous Driving

Yuxuan Liu,Zhenhua Xu,Huaiyang Huang,Lujia Wang,Ming Liu
2023-04-21
Abstract:Predicting accurate depth with monocular images is important for low-cost robotic applications and autonomous driving. This study proposes a comprehensive self-supervised framework for accurate scale-aware depth prediction on autonomous driving scenes utilizing inter-frame poses obtained from inertial measurements. In particular, we introduce a Full-Scale depth prediction network named FSNet. FSNet contains four important improvements over existing self-supervised models: (1) a multichannel output representation for stable training of depth prediction in driving scenarios, (2) an optical-flow-based mask designed for dynamic object removal, (3) a self-distillation training strategy to augment the training process, and (4) an optimization-based post-processing algorithm in test time, fusing the results from visual odometry. With this framework, robots and vehicles with only one well-calibrated camera can collect sequences of training image frames and camera poses, and infer accurate 3D depths of the environment without extra labeling work or 3D data. Extensive experiments on the KITTI dataset, KITTI-360 dataset and the nuScenes dataset demonstrate the potential of FSNet. More visualizations are presented in \url{<a class="link-external link-https" href="https://sites.google.com/view/fsnet/home" rel="external noopener nofollow">this https URL</a>}
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the autonomous driving scenario, the scale ambiguity problem faced when using monocular images to predict accurate depth information. Specifically, the existing self - supervised monocular depth prediction methods (such as Monodepth2) can only predict relative depth and cannot provide depth values with the correct global scale. This limits their effectiveness in practical applications, especially in tasks that require accurate 3D perception, such as robot deployment and autonomous driving. To solve this problem, the author proposes a comprehensive self - supervised framework FSNet, which aims to use the inter - frame pose (obtained from the Inertial Measurement Unit (IMU)) to predict the depth with the correct scale. FSNet solves the problems of existing models through the following four important improvements: 1. **Multi - channel output representation**: To stabilize the training of the depth prediction network in the driving scenario. 2. **Optical - flow - based mask design**: Used to remove dynamic objects. 3. **Self - distillation training strategy**: To enhance the training process. 4. **Optimization - based post - processing algorithm**: To fuse the results of Visual Odometry (VO) to improve the performance during testing. These improvements enable FSNet to collect training image sequences and camera poses and infer the accurate 3D depth of the environment using only one well - calibrated camera, without the need for additional annotation work or 3D data. ### Formula Summary 1. **Depth Decoding Formula**: \[ d=\frac{1}{\frac{1}{d_{\text{max}}}+\sigma(x)\left(\frac{1}{d_{\text{min}}}-\frac{1}{d_{\text{max}}}\right)} \] where \(d_{\text{max}} = 100\) and \(d_{\text{min}}=0.1\) are the boundary parameters for depth prediction. 2. **Photometric Loss Function**: \[ l_{\text{photo}}(d_0)=\alpha\left(1 - \frac{\text{SSIM}(I_0, I^l_0)}{2}\right)+\beta|I_0 - I^l_0| \] where \(\alpha = 0.85\), \(\beta=0.15\). 3. **Initial Depth Mean Formula**: \[ \lim_{N\rightarrow\infty}d'=\frac{d_{\text{max}}-d_{\text{min}}}{\ln d_{\text{max}}-\ln d_{\text{min}}} \] 4. **Point - to - Epipolar Line Distance Formula**: \[ \text{dis}_l=\frac{L\cdot[dx, dy, 1]^T}{\sqrt{L_0^2 + L_1^2}} \] 5. **Log - Depth Distribution Formula**: \[ p(\log(d)|\log(d_{\text{pseudo}}),\sigma)=-\frac{|\log(d)-\log(d_{\text{pseudo}})|}{\sigma}-\log(\sigma) \] 6. **Optimization Problem Formula**: \[ \minimize_{D_{\text{out}}}\sum_i L_{d_i}^{\text{out}} \] where, \[ L_{d_i}=\lambda_0 L_i^{\text{consist}}+\lambda_i^1 L_i^{\text{vo}} \] \[ L_i^{\text{consist}}=\sum_j(\log(d_i^0)-\log(d_j^0)-