Abstract:Large-scale pre-trained vision models are becoming increasingly prevalent, offering expressive and generalizable visual representations that benefit various downstream tasks. Recent studies on the emergent properties of these models have revealed their high-level geometric understanding, in particular in the context of depth perception. However, it remains unclear how depth perception arises in these models without explicit depth supervision provided during pre-training. To investigate this, we examine whether the monocular depth cues, similar to those used by the human visual system, emerge in these models. We introduce a new benchmark, DepthCues, designed to evaluate depth cue understanding, and present findings across 20 diverse and representative pre-trained vision models. Our analysis shows that human-like depth cues emerge in more recent larger models. We also explore enhancing depth perception in large vision models by fine-tuning on DepthCues, and find that even without dense depth supervision, this improves depth estimation. To support further research, our benchmark and evaluation code will be made publicly available for studying depth perception in vision models.

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to explore how large - scale pre - trained vision models (large vision models) can emerge monocular depth perception abilities similar to those of humans without explicit depth supervision. Specifically, the author raises the following questions: 1. **How does depth perception emerge in these large - scale vision models?** - Although these models are not provided with explicit depth supervision signals during the pre - training process, they seem to be able to understand and utilize some monocular depth cues, such as height, light and shadow, occlusion, perspective, size, and texture gradient, etc. 2. **Do these models understand and utilize monocular depth cues? And if so, how?** - To answer this question, the author introduced a new benchmark test set - DepthCues, which is used to evaluate the understanding degree of visual models on monocular depth cues. Through this benchmark test set, the author analyzed 20 different pre - trained visual models and studied their performance on different tasks. 3. **Can the depth perception ability of these models be enhanced through fine - tuning?** - The research also explored the possibility of enhancing the depth perception ability of visual models by fine - tuning on DepthCues. The results show that even without intensive depth supervision, this method can significantly improve the accuracy of depth estimation. ### Main contributions of the paper - **Developed and released the DepthCues benchmark test set**: This is a benchmark test set specifically used to evaluate the emergence of human monocular depth cues in large - scale vision models. - **Evaluated 20 visual models with different pre - training settings**: Analyzed the relative advantages and disadvantages of these models in capturing monocular depth cues. - **Revealed human monocular depth cues in self - supervised pre - training models**: Found that newer and larger models show stronger abilities in understanding these cues. - **Explored methods to enhance model depth perception through fine - tuning**: Demonstrated that injecting human monocular depth cues can improve depth perception. ### Summary By introducing the DepthCues benchmark test set, this paper systematically evaluated how large - scale vision models understand and utilize monocular depth cues without explicit depth supervision. The research results not only provide important insights into the depth perception mechanisms of these models but also point out the direction for future research and improvement.

DepthCues: Evaluating Monocular Depth Perception in Large Vision Models

Monocular Depth Estimation Based on Unsupervised Learning

Monocular Depth Estimation Using Cues Inspired by Biological Vision Systems

Depth Estimation from Monocular Images Using Dilated Convolution and Uncertainty Learning.

Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data

Depth Anything V2

Depth Is All You Need for Monocular 3D Detection

Towards Robust Monocular Depth Estimation: A New Baseline and Benchmark

Deep eyes: Joint depth inference using monocular and binocular cues

Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning

HA-Bins: Hierarchical Adaptive Bins for Robust Monocular Depth Estimation across Multiple Datasets

UniDepth: Universal Monocular Metric Depth Estimation

GlobalDepth: Global-Aware Attention Model for Unsupervised Monocular Depth Estimation.

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

Deep Monocular Depth Estimation via Integration of Global and Local Predictions

Learning Depth from Monocular Videos Using Direct Methods

Active Vision in Binocular Depth Estimation: A Top-Down Perspective

Towards Comprehensive Monocular Depth Estimation: Multiple Heads are Better Than One

Deep learning for monocular depth estimation: A review

Digging Into Self-Supervised Monocular Depth Estimation