Distortion-Aware Self-Supervised Indoor 360$^{\circ }$ Depth Estimation Via Hybrid Projection Fusion and Structural Regularities

Xu Wang,Weifeng Kong,Qiudan Zhang,You Yang,Tiesong Zhao,Jianmin Jiang
DOI: https://doi.org/10.1109/tmm.2023.3318470
IF: 7.3
2024-01-01
IEEE Transactions on Multimedia
Abstract:Owing to the rapid development of emerging $360^{\circ }$ panoramic imaging techniques, indoor $360^{\circ }$ depth estimation has aroused extensive attention in the community. Due to the lack of available ground truth depth data, it is extremely urgent to model indoor $360^{\circ }$ depth estimation in self-supervised mode. However, self-supervised $360^{\circ }$ depth estimation suffers from two major limitations. One is the distortion and network training problems caused by Equirectangular projection (ERP), and the other is that texture-less regions are quite difficult to back-propagate in self-supervised mode. Hence, to address the above issues, we introduce spherical view synthesis for learning self-supervised $360^{\circ }$ depth estimation. Specifically, to alleviate the ERP-related problems, we first propose a dual-branch distortion-aware network to produce the coarse depth map, including a distortion-aware module and a hybrid projection fusion module. Subsequently, the coarse depth map is utilized for spherical view synthesis, in which a spherically weighted loss function for view reconstruction and depth smoothing is investigated to optimize the projection distribution problem of $360^{\circ }$ images. In addition, two structural regularities of indoor $360^{\circ }$ scenes are devised as two additional supervisory signals to efficiently optimize our self-supervised $360^{\circ }$ depth estimation model, containing the principal-direction normal constraint and the co-planar depth constraint. The principal-direction normal constraint is designed to align the normal of the $360^{\circ }$ image with the direction of the vanishing points. Meanwhile, we employ the co-planar depth constraint to fit the estimated depth of each pixel through its 3D plane. Finally, a depth map is obtained for the $360^{\circ }$ image. Experimental results illustrate that our proposed method achieves superior performance than the current advanced depth estimation methods on four publicly available datasets.
What problem does this paper attempt to address?