Abstract:Depth estimation is a critical topic for robotics and vision-related tasks. In monocular depth estimation, in comparison with supervised learning that requires expensive ground truth labeling, self-supervised methods possess great potential due to no labeling cost. However, self-supervised learning still has a large gap with supervised learning in 3D reconstruction and depth estimation performance. Meanwhile, scaling is also a major issue for monocular unsupervised depth estimation, which commonly still needs ground truth scale from GPS, LiDAR, or existing maps to correct. In the era of deep learning, existing methods primarily rely on exploring image relationships to train unsupervised neural networks, while the physical properties of the camera itself such as intrinsics and extrinsics are often overlooked. These physical properties are not just mathematical parameters; they are embodiments of the camera's interaction with the physical world. By embedding these physical properties into the deep learning model, we can calculate depth priors for ground regions and regions connected to the ground based on physical principles, providing free supervision signals without the need for additional sensors. This approach is not only easy to implement but also enhances the effects of all unsupervised methods by embedding the camera's physical properties into the model, thereby achieving an embodied understanding of the real world.

What problem does this paper attempt to address?

The main problems that this paper attempts to solve are two key challenges in monocular depth estimation: 1. **Performance gap between self - supervised learning and supervised learning**: In monocular depth estimation, although self - supervised methods do not require expensive ground - truth annotations, they still lag far behind supervised learning methods in 3D reconstruction and depth estimation performance. Moreover, self - supervised methods usually need additional sensors (such as GPS, LiDAR or existing maps) to provide real - scale information to correct the scale problem. 2. **Neglect of physical camera model parameters**: Existing self - supervised methods mainly rely on exploring the relationships between images to train neural networks, ignoring the intrinsic and extrinsic physical characteristics of the camera itself. These physical characteristics are not only mathematical parameters, but also reflect the interaction between the camera and the physical world. To solve these problems, the author proposes a new method that embeds the physical parameters of the camera model into the deep - learning model to calculate the depth information of most areas in the scene. Specifically, this method improves self - supervised monocular depth estimation in the following ways: - **Introducing the concept of "physical depth"**: Using the physical model parameters of the camera (such as focal length, optical center, etc.), combined with the semantic segmentation results, accurately calculate the absolute depth of the flat ground area and use it as a supervision signal. - **Solving the scale uncertainty problem**: Solve the scale ambiguity problem in self - supervised monocular depth estimation by using the absolute scale provided by the physical depth instead of the relative scale. - **Designing a new training framework**: Combine physical - depth supervision with self - supervised methods to develop an effective neural - network training framework, especially optimized for the physical depth calculated from the camera model. This method not only improves the accuracy of depth estimation, but also enhances the understanding of the real world, providing strong support for detailed 3D structure modeling. Experimental results show that the performance of this method on datasets such as KITTI and Cityscapes is close to or even better than that of LiDAR - based depth - estimation methods, especially on flat surfaces.

Embodiment: Self-Supervised Depth Estimation Based on Camera Models

Monocular Depth Estimation Based on Unsupervised Learning

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Unsupervised Learning-based Depth Estimation aided Visual SLAM Approach

3D Object Aided Self-Supervised Monocular Depth Estimation

Depth360: Self-supervised Learning for Monocular Depth Estimation using Learnable Camera Distortion Model

Digging Into Self-Supervised Monocular Depth Estimation

Self-Supervised Learning based Depth Estimation from Monocular Images

A Lightweight Self-Supervised Training Framework for Monocular Depth Estimation

Self-Supervised 3D Reconstruction and Ego-Motion Estimation Via On-Board Monocular Video

Structure-Centric Robust Monocular Depth Estimation via Knowledge Distillation

Self-Supervised Monocular Depth Estimation with Self-Reference Distillation and Disparity Offset Refinement

Unsupervised Learning of Monocular Depth and Ego-motion in Outdoor/Indoor Environments

Self-Supervised Learning for Monocular Depth Estimation from Aerial Imagery

Self-Supervised Learning of Depth and Ego-motion for 3D Perception in Human Computer Interaction

Unsupervised Monocular Estimation of Depth and Visual Odometry uUsing Attention and Depth-Pose Consistency Loss

Towards Scale-Aware, Robust, and Generalizable Unsupervised Monocular Depth Estimation by Integrating IMU Motion Dynamics

SENSE: Self-Evolving Learning for Self-Supervised Monocular Depth Estimation

WS-SfMLearner: Self-supervised Monocular Depth and Ego-motion Estimation on Surgical Videos with Unknown Camera Parameters

Depth Estimation Based on Monocular Camera Sensors in Autonomous Vehicles: A Self-supervised Learning Approach