Abstract:Monocular depth estimation is very challenging because clues to the exact depth are incomplete in a single RGB image. To overcome the limitation, deep neural networks rely on various visual hints such as size, shade, and texture extracted from RGB information. However, we observe that if such hints are overly exploited, the network can be biased on RGB information without considering the comprehensive view. We propose a novel depth estimation model named RElative Depth Transformer (RED-T) that uses relative depth as guidance in self-attention. Specifically, the model assigns high attention weights to pixels of close depth and low attention weights to pixels of distant depth. As a result, the features of similar depth can become more likely to each other and thus less prone to misused visual hints. We show that the proposed model achieves competitive results in monocular depth estimation benchmarks and is less biased to RGB information. In addition, we propose a novel monocular depth estimation benchmark that limits the observable depth range during training in order to evaluate the robustness of the model for unseen depths.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address the issue of visual pits in Monocular Depth Estimation (MDE). Specifically, the MDE task is highly challenging because the clues for obtaining accurate depth information from a single RGB image are incomplete. Although deep neural networks overcome this limitation by utilizing visual cues such as size, shadow, and texture, over-reliance on these visual cues can lead to a bias towards RGB information, thereby affecting the accuracy of depth estimation. ### Solution To solve the above problem, the authors propose a new depth estimation model called the Relative Depth Transformer (RED-T). The core idea of this model is to use relative depth as guidance in the self-attention mechanism. The specific steps are as follows: 1. **Relative Depth Calculation**: Calculate the relative depth between each pair of pixels. 2. **Discretization and Embedding**: Discretize the relative depth values and map them to trainable embedding parameters. 3. **Depth Relative Self-Attention**: Incorporate the relative depth embedding parameters into the self-attention weights, so that pixels with similar depths have higher attention weights. In this way, the model can reduce its reliance on RGB information, thereby mitigating the impact of visual pits on depth estimation. ### Experimental Validation To verify the effectiveness of RED-T, the authors conducted the following experiments: 1. **Benchmark Testing**: Experiments were conducted on the NYU-v2 and KITTI datasets, and the results showed that RED-T outperformed existing MDE models on multiple metrics. 2. **Range-Limited MDE Task**: A new MDE task was proposed, where only a limited range of depth labels is provided during the training phase. Experimental results indicate that RED-T performs more robustly when dealing with unseen depth ranges. ### Main Contributions 1. **Introduction of Relative Depth**: For the first time, relative depth is used as guidance to address the issue of visual pits in MDE. 2. **Depth Relative Attention Mechanism**: A new depth relative attention mechanism is designed, allowing the model to consider depth information more during feature extraction. 3. **Performance Evaluation**: The superior performance of RED-T is validated on multiple datasets, and new evaluation scenarios are proposed, demonstrating the model's robustness in practical applications. ### Conclusion By utilizing relative depth information, RED-T can effectively reduce the impact of visual pits on depth estimation, improving the accuracy and robustness of the model. This method has significant application value in monocular depth estimation tasks.

Depth-Relative Self Attention for Monocular Depth Estimation

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Synthetic Depth Transfer for Monocular 3D Object Pose Estimation in the Wild.

Depth Estimation from Monocular Images Using Dilated Convolution and Uncertainty Learning.

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

Monocular Depth Estimation with Augmented Ordinal Depth Relationships

Towards Comprehensive Monocular Depth Estimation: Multiple Heads are Better Than One

Attention-Based Monocular Depth Estimation Considering Global and Local Information in Remote Sensing Images

Self-supervised Monocular Depth Estimation with Coordinate Attention

Edge-Enhanced Dual-Stream Perception Network for Monocular Depth Estimation

Self-supervised Monocular Depth Estimation with Large Kernel Attention

MDSNet: self-supervised monocular depth estimation for video sequences using self-attention and threshold mask

Self-supervised coarse-to-fine monocular depth estimation using a lightweight attention module

Deep Monocular Depth Estimation Based on Content and Contextual Features

HR-Depth: High Resolution Self-Supervised Monocular Depth Estimation

Exploiting Depth from Single Monocular Images for Object Detection and Semantic Segmentation

Depth Is All You Need for Monocular 3D Detection

Weakly-Supervised Monocular Depth Estimationwith Resolution-Mismatched Data

GlobalDepth: Global-Aware Attention Model for Unsupervised Monocular Depth Estimation.

Monocular Depth Estimation via Self-Supervised Self-Distillation

Depthformer : Multiscale Vision Transformer For Monocular Depth Estimation With Local Global Information Fusion