Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer

Guodong Sun,Junjie Liu,Mingxuan Liu,Moyun Liu,Yang Zhang
2024-06-13
Abstract:Self-supervised monocular depth estimation aims to infer depth information without relying on labeled data. However, the lack of labeled information poses a significant challenge to the model's representation, limiting its ability to capture the intricate details of the scene accurately. Prior information can potentially mitigate this issue, enhancing the model's understanding of scene structure and texture. Nevertheless, solely relying on a single type of prior information often falls short when dealing with complex scenes, necessitating improvements in generalization performance. To address these challenges, we introduce a novel self-supervised monocular depth estimation model that leverages multiple priors to bolster representation capabilities across spatial, context, and semantic dimensions. Specifically, we employ a hybrid transformer and a lightweight pose network to obtain long-range spatial priors in the spatial dimension. Then, the context prior attention is designed to improve generalization, particularly in complex structures or untextured areas. In addition, semantic priors are introduced by leveraging semantic boundary loss, and semantic prior attention is supplemented, further refining the semantic features extracted by the decoder. Experiments on three diverse datasets demonstrate the effectiveness of the proposed model. It integrates multiple priors to comprehensively enhance the representation ability, improving the accuracy and reliability of depth estimation. Codes are available at: \url{<a class="link-external link-https" href="https://github.com/MVME-HBUT/MPRLNet" rel="external noopener nofollow">this https URL</a>}
Computer Vision and Pattern Recognition,Image and Video Processing
What problem does this paper attempt to address?
The main problem this paper attempts to address is the lack of representational capacity and generalization performance in self-supervised monocular depth estimation. Specifically, due to the lack of annotated data, existing self-supervised methods struggle to accurately capture detailed information in complex scenes and textureless regions, limiting the model's generalization ability. To solve these issues, the authors propose a new self-supervised monocular depth estimation model that enhances the model's representational capacity and generalization performance by leveraging multiple prior information (including spatial priors, contextual priors, and semantic priors). ### Specific Problems and Solutions: 1. **Insufficient Representational Capacity**: Existing methods struggle to capture complex details in scenes due to the lack of annotated data. To address this, the paper introduces multiple prior information, particularly by using a Hybrid Transformer to obtain long-range spatial priors and enhancing the model's representational capacity through Context Prior Attention (CPA) and Semantic Prior Attention (SPA) mechanisms. 2. **Poor Generalization Performance**: Existing methods have poor generalization performance when dealing with complex structures or textureless regions. The paper designs a Context Prior Attention mechanism that specifically focuses on the surrounding pixels of these regions, significantly improving the model's generalization ability. 3. **Boundary Scale Bias**: In depth estimation, models often exhibit boundary scale bias in complex scenes and weak texture regions. The paper addresses this issue by introducing Semantic Boundary Loss (SBL) and Semantic Prior Attention mechanisms, enabling the model to more accurately capture object boundaries. ### Main Contributions: - **Hybrid Transformer and Lightweight Pose Network**: Used to model long-range dependencies in the spatial dimension, enhancing the model's ability to capture global spatial relationships. - **Context Prior Attention Mechanism**: Specifically designed to perceive the pixels around complex structures or limited texture regions, significantly improving generalization ability. - **Semantic Priors**: Address boundary scale bias issues through Semantic Boundary Loss and Semantic Prior Attention mechanisms, refining semantic features. - **Complementarity of Multi-Dimensional Prior Information**: Explores the complementarity of spatial, contextual, and semantic prior information to enhance the representational learning ability of self-supervised monocular depth estimation, improving model performance and generalization ability. Through these innovations, the proposed model performs excellently on multiple datasets, particularly achieving an Absolute Relative Error (Abs Rel) of 0.104 and a Squared Relative Error (Sq Rel) of 0.705 on the KITTI dataset, outperforming existing methods.