ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation

Suraj Patni,Aradhye Agarwal,Chetan Arora
2024-04-17
Abstract:In the absence of parallax cues, a learning-based single image depth estimation (SIDE) model relies heavily on shading and contextual cues in the image. While this simplicity is attractive, it is necessary to train such models on large and varied datasets, which are difficult to capture. It has been shown that using embeddings from pre-trained foundational models, such as CLIP, improves zero shot transfer in several applications. Taking inspiration from this, in our paper we explore the use of global image priors generated from a pre-trained ViT model to provide more detailed contextual information. We argue that the embedding vector from a ViT model, pre-trained on a large dataset, captures greater relevant information for SIDE than the usual route of generating pseudo image captions, followed by CLIP based text embeddings. Based on this idea, we propose a new SIDE model using a diffusion backbone which is conditioned on ViT embeddings. Our proposed design establishes a new state-of-the-art (SOTA) for SIDE on NYUv2 dataset, achieving Abs Rel error of 0.059 (14% improvement) compared to 0.069 by the current SOTA (VPD). And on KITTI dataset, achieving Sq Rel error of 0.139 (2% improvement) compared to 0.142 by the current SOTA (GEDepth). For zero-shot transfer with a model trained on NYUv2, we report mean relative improvement of (20%, 23%, 81%, 25%) over NeWCRFs on (Sun-RGBD, iBims1, DIODE, HyperSim) datasets, compared to (16%, 18%, 45%, 9%) by ZoeDepth. The project page is available at
Computer Vision and Pattern Recognition,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to effectively utilize global image prior information in single - image depth estimation (SIDE) to improve the performance and generalization ability of the model on different datasets. Specifically: 1. **Problem Background**: - The single - image depth estimation task is to predict the depth value of each pixel from a single RGB image. This is a fundamental computer vision problem and has wide applications in fields such as robotics, autonomous driving, and augmented reality. - Due to the lack of disparity cues, learning - based single - image depth estimation models rely on shading and context cues in the image. This makes the model need to be trained on a large and diverse set of datasets, which are difficult to obtain. 2. **Limitations of Existing Methods**: - Existing methods usually rely on pseudo - image caption generation and then use text embedding models such as CLIP to provide semantic context. However, the limitation of this method is that text descriptions usually only focus on the significant large objects in the image, ignoring more detailed scene information. - Using pre - trained large - scale base models (such as CLIP) is helpful for zero - shot transfer, but the method of generating pseudo - captions and then embedding may not be the most effective way. 3. **New Method Proposed in the Paper**: - The paper proposes a new SIDE model named ECoDepth. This model uses a diffusion model as the backbone network and is conditioned by the global image prior information extracted by the ViT model. - The ViT model is pre - trained on a large - scale dataset and can capture more relevant scene information, which is superior to the traditional pseudo - caption generation method. 4. **Main Contributions**: - A new SIDE framework based on the conditional diffusion model is proposed, which uses ViT embeddings to provide richer semantic context and achieves new SOTA (state - of - the - art) performance. - On the NYU Depth v2 and KITTI datasets, significant improvements of 14% and 2% are achieved respectively. - In the zero - shot transfer task, the model trained only on NYU Depth v2 also performs excellently on the other four unseen datasets, with relative improvements of 21%, 23%, 81%, and 25% respectively. In summary, this paper aims to improve the performance and generalization ability of the single - image depth estimation model by introducing more effective global image prior information (i.e., ViT embeddings), so as to better solve the single - image depth estimation problem.