Complete contextual information extraction for self-supervised monocular depth estimation
Dazheng Zhou,Mingliang Zhang,Xianjie Gao,Youmei Zhang,Bin Li
DOI: https://doi.org/10.1016/j.cviu.2024.104032
IF: 4.886
2024-05-17
Computer Vision and Image Understanding
Abstract:Self-supervised learning methods are increasingly important for monocular depth estimation since they do not require ground-truth data during training. Although existing methods have achieved great success for better monocular depth estimation based on Convolutional Neural Networks (CNNs), the limited receptive field of CNNs usually is insufficient to effectively model the global information, e.g., relationship between foreground and background or relationship among objects, which are crucial for accurately capturing scene structure. Recently, some studies based on Transformers have attracted significant interest in computer vision. However, duo to the lack of spatial locality bias, they may fail to model the local information, e.g., fine-grained details with an image. To tackle these issues, we propose a novel self-supervised learning framework by incorporating the advantages of both the CNNs and Transformers so as to model the complete contextual information for high-quality monocular depth estimation. Specifically, the proposed method mainly includes two branches, where the Transformer branch is considered to capture the global information while the Convolution branch is exploited to preserve the local information. We also design a rectangle convolution module with pyramid structure to perceive the semi-global information, e.g. thin objects, along the horizontal and vertical directions within an image. Moreover, we propose a shape refinement module by learning the affinity matrix between pixel and its neighborhood to obtain accurate geometrical structure of scenes. Extensive experiments evaluated on KITTI, Cityscapes and Make3D dataset demonstrate that the proposed method achieves the competitive result compared with the state-of-the-art self-supervised monocular depth estimation methods and shows good cross-dataset generalization ability.
computer science, artificial intelligence,engineering, electrical & electronic