Depth-Relative Self Attention for Monocular Depth Estimation

Kyuhong Shim,Jiyoung Kim,Gusang Lee,Byonghyo Shim
2023-04-25
Abstract:Monocular depth estimation is very challenging because clues to the exact depth are incomplete in a single RGB image. To overcome the limitation, deep neural networks rely on various visual hints such as size, shade, and texture extracted from RGB information. However, we observe that if such hints are overly exploited, the network can be biased on RGB information without considering the comprehensive view. We propose a novel depth estimation model named RElative Depth Transformer (RED-T) that uses relative depth as guidance in self-attention. Specifically, the model assigns high attention weights to pixels of close depth and low attention weights to pixels of distant depth. As a result, the features of similar depth can become more likely to each other and thus less prone to misused visual hints. We show that the proposed model achieves competitive results in monocular depth estimation benchmarks and is less biased to RGB information. In addition, we propose a novel monocular depth estimation benchmark that limits the observable depth range during training in order to evaluate the robustness of the model for unseen depths.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the issue of visual pits in Monocular Depth Estimation (MDE). Specifically, the MDE task is highly challenging because the clues for obtaining accurate depth information from a single RGB image are incomplete. Although deep neural networks overcome this limitation by utilizing visual cues such as size, shadow, and texture, over-reliance on these visual cues can lead to a bias towards RGB information, thereby affecting the accuracy of depth estimation. ### Solution To solve the above problem, the authors propose a new depth estimation model called the Relative Depth Transformer (RED-T). The core idea of this model is to use relative depth as guidance in the self-attention mechanism. The specific steps are as follows: 1. **Relative Depth Calculation**: Calculate the relative depth between each pair of pixels. 2. **Discretization and Embedding**: Discretize the relative depth values and map them to trainable embedding parameters. 3. **Depth Relative Self-Attention**: Incorporate the relative depth embedding parameters into the self-attention weights, so that pixels with similar depths have higher attention weights. In this way, the model can reduce its reliance on RGB information, thereby mitigating the impact of visual pits on depth estimation. ### Experimental Validation To verify the effectiveness of RED-T, the authors conducted the following experiments: 1. **Benchmark Testing**: Experiments were conducted on the NYU-v2 and KITTI datasets, and the results showed that RED-T outperformed existing MDE models on multiple metrics. 2. **Range-Limited MDE Task**: A new MDE task was proposed, where only a limited range of depth labels is provided during the training phase. Experimental results indicate that RED-T performs more robustly when dealing with unseen depth ranges. ### Main Contributions 1. **Introduction of Relative Depth**: For the first time, relative depth is used as guidance to address the issue of visual pits in MDE. 2. **Depth Relative Attention Mechanism**: A new depth relative attention mechanism is designed, allowing the model to consider depth information more during feature extraction. 3. **Performance Evaluation**: The superior performance of RED-T is validated on multiple datasets, and new evaluation scenarios are proposed, demonstrating the model's robustness in practical applications. ### Conclusion By utilizing relative depth information, RED-T can effectively reduce the impact of visual pits on depth estimation, improving the accuracy and robustness of the model. This method has significant application value in monocular depth estimation tasks.