Embodiment: Self-Supervised Depth Estimation Based on Camera Models

Jinchang Zhang,Praveen Kumar Reddy,Xue-Iuan Wong,Yiannis Aloimonos,Guoyu Lu
2024-08-29
Abstract:Depth estimation is a critical topic for robotics and vision-related tasks. In monocular depth estimation, in comparison with supervised learning that requires expensive ground truth labeling, self-supervised methods possess great potential due to no labeling cost. However, self-supervised learning still has a large gap with supervised learning in 3D reconstruction and depth estimation performance. Meanwhile, scaling is also a major issue for monocular unsupervised depth estimation, which commonly still needs ground truth scale from GPS, LiDAR, or existing maps to correct. In the era of deep learning, existing methods primarily rely on exploring image relationships to train unsupervised neural networks, while the physical properties of the camera itself such as intrinsics and extrinsics are often overlooked. These physical properties are not just mathematical parameters; they are embodiments of the camera's interaction with the physical world. By embedding these physical properties into the deep learning model, we can calculate depth priors for ground regions and regions connected to the ground based on physical principles, providing free supervision signals without the need for additional sensors. This approach is not only easy to implement but also enhances the effects of all unsupervised methods by embedding the camera's physical properties into the model, thereby achieving an embodied understanding of the real world.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The main problems that this paper attempts to solve are two key challenges in monocular depth estimation: 1. **Performance gap between self - supervised learning and supervised learning**: In monocular depth estimation, although self - supervised methods do not require expensive ground - truth annotations, they still lag far behind supervised learning methods in 3D reconstruction and depth estimation performance. Moreover, self - supervised methods usually need additional sensors (such as GPS, LiDAR or existing maps) to provide real - scale information to correct the scale problem. 2. **Neglect of physical camera model parameters**: Existing self - supervised methods mainly rely on exploring the relationships between images to train neural networks, ignoring the intrinsic and extrinsic physical characteristics of the camera itself. These physical characteristics are not only mathematical parameters, but also reflect the interaction between the camera and the physical world. To solve these problems, the author proposes a new method that embeds the physical parameters of the camera model into the deep - learning model to calculate the depth information of most areas in the scene. Specifically, this method improves self - supervised monocular depth estimation in the following ways: - **Introducing the concept of "physical depth"**: Using the physical model parameters of the camera (such as focal length, optical center, etc.), combined with the semantic segmentation results, accurately calculate the absolute depth of the flat ground area and use it as a supervision signal. - **Solving the scale uncertainty problem**: Solve the scale ambiguity problem in self - supervised monocular depth estimation by using the absolute scale provided by the physical depth instead of the relative scale. - **Designing a new training framework**: Combine physical - depth supervision with self - supervised methods to develop an effective neural - network training framework, especially optimized for the physical depth calculated from the camera model. This method not only improves the accuracy of depth estimation, but also enhances the understanding of the real world, providing strong support for detailed 3D structure modeling. Experimental results show that the performance of this method on datasets such as KITTI and Cityscapes is close to or even better than that of LiDAR - based depth - estimation methods, especially on flat surfaces.