BetterDepth: Plug-and-Play Diffusion Refiner for Zero-Shot Monocular Depth Estimation

Xiang Zhang,Bingxin Ke,Hayko Riemenschneider,Nando Metzger,Anton Obukhov,Markus Gross,Konrad Schindler,Christopher Schroers
2024-07-25
Abstract:By training over large-scale datasets, zero-shot monocular depth estimation (MDE) methods show robust performance in the wild but often suffer from insufficiently precise details. Although recent diffusion-based MDE approaches exhibit appealing detail extraction ability, they still struggle in geometrically challenging scenes due to the difficulty of gaining robust geometric priors from diverse datasets. To leverage the complementary merits of both worlds, we propose BetterDepth to efficiently achieve geometrically correct affine-invariant MDE performance while capturing fine-grained details. Specifically, BetterDepth is a conditional diffusion-based refiner that takes the prediction from pre-trained MDE models as depth conditioning, in which the global depth context is well-captured, and iteratively refines details based on the input image. For the training of such a refiner, we propose global pre-alignment and local patch masking methods to ensure the faithfulness of BetterDepth to depth conditioning while learning to capture fine-grained scene details. By efficient training on small-scale synthetic datasets, BetterDepth achieves state-of-the-art zero-shot MDE performance on diverse public datasets and in-the-wild scenes. Moreover, BetterDepth can improve the performance of other MDE models in a plug-and-play manner without additional re-training.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address two main issues in Monocular Depth Estimation (MDE): 1. **Lack of detail accuracy**: Although zero-shot monocular depth estimation methods trained on large-scale datasets exhibit robust performance in wild scenarios, they often lack detail accuracy. 2. **Difficulty in acquiring geometric priors**: While MDE methods based on diffusion models can extract impressive details, they perform poorly in geometrically challenging scenes due to the difficulty in obtaining strong geometric priors from diverse datasets. To address the above issues, the research team proposed BetterDepth, a conditional diffusion model that effectively combines the global context capturing ability of zero-shot MDE methods with the detail advantages of diffusion model-based methods. Specifically, BetterDepth uses the predictions of a pre-trained MDE model as depth conditions and iteratively refines the details of the input image based on these conditions. To train this refiner, the authors proposed a global pre-alignment method and a local patch masking strategy to ensure that BetterDepth maintains fidelity to the depth conditions while learning to refine details. By effectively training on a small-scale synthetic dataset, BetterDepth achieves state-of-the-art zero-shot MDE performance on various public datasets and wild scenarios, and can enhance the performance of other MDE models in a plug-and-play manner without additional retraining.