Abstract:Self-supervised monocular depth estimation aims to infer depth information without relying on labeled data. However, the lack of labeled information poses a significant challenge to the model's representation, limiting its ability to capture the intricate details of the scene accurately. Prior information can potentially mitigate this issue, enhancing the model's understanding of scene structure and texture. Nevertheless, solely relying on a single type of prior information often falls short when dealing with complex scenes, necessitating improvements in generalization performance. To address these challenges, we introduce a novel self-supervised monocular depth estimation model that leverages multiple priors to bolster representation capabilities across spatial, context, and semantic dimensions. Specifically, we employ a hybrid transformer and a lightweight pose network to obtain long-range spatial priors in the spatial dimension. Then, the context prior attention is designed to improve generalization, particularly in complex structures or untextured areas. In addition, semantic priors are introduced by leveraging semantic boundary loss, and semantic prior attention is supplemented, further refining the semantic features extracted by the decoder. Experiments on three diverse datasets demonstrate the effectiveness of the proposed model. It integrates multiple priors to comprehensively enhance the representation ability, improving the accuracy and reliability of depth estimation. Codes are available at: \url{<a class="link-external link-https" href="https://github.com/MVME-HBUT/MPRLNet" rel="external noopener nofollow">this https URL</a>}

What problem does this paper attempt to address?

The main problem this paper attempts to address is the lack of representational capacity and generalization performance in self-supervised monocular depth estimation. Specifically, due to the lack of annotated data, existing self-supervised methods struggle to accurately capture detailed information in complex scenes and textureless regions, limiting the model's generalization ability. To solve these issues, the authors propose a new self-supervised monocular depth estimation model that enhances the model's representational capacity and generalization performance by leveraging multiple prior information (including spatial priors, contextual priors, and semantic priors). ### Specific Problems and Solutions: 1. **Insufficient Representational Capacity**: Existing methods struggle to capture complex details in scenes due to the lack of annotated data. To address this, the paper introduces multiple prior information, particularly by using a Hybrid Transformer to obtain long-range spatial priors and enhancing the model's representational capacity through Context Prior Attention (CPA) and Semantic Prior Attention (SPA) mechanisms. 2. **Poor Generalization Performance**: Existing methods have poor generalization performance when dealing with complex structures or textureless regions. The paper designs a Context Prior Attention mechanism that specifically focuses on the surrounding pixels of these regions, significantly improving the model's generalization ability. 3. **Boundary Scale Bias**: In depth estimation, models often exhibit boundary scale bias in complex scenes and weak texture regions. The paper addresses this issue by introducing Semantic Boundary Loss (SBL) and Semantic Prior Attention mechanisms, enabling the model to more accurately capture object boundaries. ### Main Contributions: - **Hybrid Transformer and Lightweight Pose Network**: Used to model long-range dependencies in the spatial dimension, enhancing the model's ability to capture global spatial relationships. - **Context Prior Attention Mechanism**: Specifically designed to perceive the pixels around complex structures or limited texture regions, significantly improving generalization ability. - **Semantic Priors**: Address boundary scale bias issues through Semantic Boundary Loss and Semantic Prior Attention mechanisms, refining semantic features. - **Complementarity of Multi-Dimensional Prior Information**: Explores the complementarity of spatial, contextual, and semantic prior information to enhance the representational learning ability of self-supervised monocular depth estimation, improving model performance and generalization ability. Through these innovations, the proposed model performs excellently on multiple datasets, particularly achieving an Absolute Relative Error (Abs Rel) of 0.104 and a Squared Relative Error (Sq Rel) of 0.705 on the KITTI dataset, outperforming existing methods.

Multiple Prior Representation Learning for Self-Supervised Monocular Depth Estimation via Hybrid Transformer

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Monocular Depth Estimation Based on Unsupervised Learning

Self-supervised multi-frame depth estimation with visual-inertial pose transformer and monocular guidance

Towards Comprehensive Monocular Depth Estimation: Multiple Heads are Better Than One

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

Lightweight Self-Supervised Monocular Depth Estimation Through CNN and Transformer Integration

Complete contextual information extraction for self-supervised monocular depth estimation

High-Precision Self-Supervised Monocular Depth Estimation with Rich-Resource Prior

Deep Digging into the Generalization of Self-Supervised Monocular Depth Estimation

HA-Bins: Hierarchical Adaptive Bins for Robust Monocular Depth Estimation across Multiple Datasets

Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion

TAMDepth: self-supervised monocular depth estimation with transformer and adapter modulation

Monocular Depth Estimation via Self-Supervised Self-Distillation

Self-supervised Monocular Depth Estimation with Large Kernel Attention

SC-DepthV3: Robust Self-supervised Monocular Depth Estimation for Dynamic Scenes

Semantically-Guided Representation Learning for Self-Supervised Monocular Depth

Bridging local and global representations for self-supervised monocular depth estimation

Self-Supervised Monocular Depth Estimation with Self-Reference Distillation and Disparity Offset Refinement

Transformer-Based Self-Supervised Monocular Depth and Visual Odometry

MonoViT: Self-Supervised Monocular Depth Estimation with a Vision Transformer