Abstract:Predicting accurate depth with monocular images is important for low-cost robotic applications and autonomous driving. This study proposes a comprehensive self-supervised framework for accurate scale-aware depth prediction on autonomous driving scenes utilizing inter-frame poses obtained from inertial measurements. In particular, we introduce a Full-Scale depth prediction network named FSNet. FSNet contains four important improvements over existing self-supervised models: (1) a multichannel output representation for stable training of depth prediction in driving scenarios, (2) an optical-flow-based mask designed for dynamic object removal, (3) a self-distillation training strategy to augment the training process, and (4) an optimization-based post-processing algorithm in test time, fusing the results from visual odometry. With this framework, robots and vehicles with only one well-calibrated camera can collect sequences of training image frames and camera poses, and infer accurate 3D depths of the environment without extra labeling work or 3D data. Extensive experiments on the KITTI dataset, KITTI-360 dataset and the nuScenes dataset demonstrate the potential of FSNet. More visualizations are presented in \url{<a class="link-external link-https" href="https://sites.google.com/view/fsnet/home" rel="external noopener nofollow">this https URL</a>}

What problem does this paper attempt to address?

The problem that this paper attempts to solve is: in the autonomous driving scenario, the scale ambiguity problem faced when using monocular images to predict accurate depth information. Specifically, the existing self - supervised monocular depth prediction methods (such as Monodepth2) can only predict relative depth and cannot provide depth values with the correct global scale. This limits their effectiveness in practical applications, especially in tasks that require accurate 3D perception, such as robot deployment and autonomous driving. To solve this problem, the author proposes a comprehensive self - supervised framework FSNet, which aims to use the inter - frame pose (obtained from the Inertial Measurement Unit (IMU)) to predict the depth with the correct scale. FSNet solves the problems of existing models through the following four important improvements: 1. **Multi - channel output representation**: To stabilize the training of the depth prediction network in the driving scenario. 2. **Optical - flow - based mask design**: Used to remove dynamic objects. 3. **Self - distillation training strategy**: To enhance the training process. 4. **Optimization - based post - processing algorithm**: To fuse the results of Visual Odometry (VO) to improve the performance during testing. These improvements enable FSNet to collect training image sequences and camera poses and infer the accurate 3D depth of the environment using only one well - calibrated camera, without the need for additional annotation work or 3D data. ### Formula Summary 1. **Depth Decoding Formula**: \[ d=\frac{1}{\frac{1}{d_{\text{max}}}+\sigma(x)\left(\frac{1}{d_{\text{min}}}-\frac{1}{d_{\text{max}}}\right)} \] where \(d_{\text{max}} = 100\) and \(d_{\text{min}}=0.1\) are the boundary parameters for depth prediction. 2. **Photometric Loss Function**: \[ l_{\text{photo}}(d_0)=\alpha\left(1 - \frac{\text{SSIM}(I_0, I^l_0)}{2}\right)+\beta|I_0 - I^l_0| \] where \(\alpha = 0.85\), \(\beta=0.15\). 3. **Initial Depth Mean Formula**: \[ \lim_{N\rightarrow\infty}d'=\frac{d_{\text{max}}-d_{\text{min}}}{\ln d_{\text{max}}-\ln d_{\text{min}}} \] 4. **Point - to - Epipolar Line Distance Formula**: \[ \text{dis}_l=\frac{L\cdot[dx, dy, 1]^T}{\sqrt{L_0^2 + L_1^2}} \] 5. **Log - Depth Distribution Formula**: \[ p(\log(d)|\log(d_{\text{pseudo}}),\sigma)=-\frac{|\log(d)-\log(d_{\text{pseudo}})|}{\sigma}-\log(\sigma) \] 6. **Optimization Problem Formula**: \[ \minimize_{D_{\text{out}}}\sum_i L_{d_i}^{\text{out}} \] where, \[ L_{d_i}=\lambda_0 L_i^{\text{consist}}+\lambda_i^1 L_i^{\text{vo}} \] \[ L_i^{\text{consist}}=\sum_j(\log(d_i^0)-\log(d_j^0)-

FSNet: Redesign Self-Supervised MonoDepth for Full-Scale Depth Prediction for Autonomous Driving

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

A Robust Monocular Depth Estimation Framework Based on Light-Weight ERF-Pspnet for Day-Night Driving Scenes

Region Deformer Networks for Unsupervised Depth Estimation from Unconstrained Monocular Videos

Depth Estimation of Traffic Scenes from Image Sequence Using Deep Learning.

Self-Supervised Monocular Depth Estimation Based on High-Order Spatial Interactions

FA-Depth: Toward Fast and Accurate Self-supervised Monocular Depth Estimation

Self-supervised Depth Estimation Leveraging Global Perception and Geometric Smoothness Using On-board Videos

Self-Supervised Depth Completion From Direct Visual-LiDAR Odometry in Autonomous Driving

Resolution-sensitive self-supervised monocular absolute depth estimation

SAU-Net: Monocular Depth Estimation Combining Multi-Scale Features and Attention Mechanisms

Self-Supervised Monocular Depth Estimation with Binary Mask and Lightweight Network

Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance

MDS-Net: Multi-Scale Depth Stratification 3D Object Detection from Monocular Images

Novel Hybrid Neural Network for Dense Depth Estimation Using On-Board Monocular Images

Unsupervised Scale-Consistent Depth Learning from Video

MDSNet: self-supervised monocular depth estimation for video sequences using self-attention and threshold mask

Unsupervised Monocular Estimation of Depth and Visual Odometry uUsing Attention and Depth-Pose Consistency Loss

FIS-Nets: Full-image Supervised Networks for Monocular Depth Estimation

KDepthNet: Mono-Camera Based Depth Estimation for Autonomous Driving

Self-supervised Sparse-to-Dense: Self-supervised Depth Completion from LiDAR and Monocular Camera