PatchRefiner: Leveraging Synthetic Data for Real-Domain High-Resolution Monocular Metric Depth Estimation

Zhenyu Li,Shariq Farooq Bhat,Peter Wonka
2024-06-11
Abstract:This paper introduces PatchRefiner, an advanced framework for metric single image depth estimation aimed at high-resolution real-domain inputs. While depth estimation is crucial for applications such as autonomous driving, 3D generative modeling, and 3D reconstruction, achieving accurate high-resolution depth in real-world scenarios is challenging due to the constraints of existing architectures and the scarcity of detailed real-world depth data. PatchRefiner adopts a tile-based methodology, reconceptualizing high-resolution depth estimation as a refinement process, which results in notable performance enhancements. Utilizing a pseudo-labeling strategy that leverages synthetic data, PatchRefiner incorporates a Detail and Scale Disentangling (DSD) loss to enhance detail capture while maintaining scale accuracy, thus facilitating the effective transfer of knowledge from synthetic to real-world data. Our extensive evaluations demonstrate PatchRefiner's superior performance, significantly outperforming existing benchmarks on the Unreal4KStereo dataset by 18.1% in terms of the root mean squared error (RMSE) and showing marked improvements in detail accuracy and consistent scale estimation on diverse real-world datasets like CityScape, ScanNet++, and ETH3D.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the metric depth estimation problem of monocular images in high - resolution real - world scenes. Specifically, the author points out the challenges faced by existing methods when dealing with high - resolution real - domain inputs, including: 1. **Resolution limitations of existing architectures**: Most state - of - the - art depth - estimation architectures have limitations in memory and computational resources when processing high - resolution images. 2. **Scarcity of high - quality real - world depth data**: High - resolution real - world depth datasets are very scarce. Existing datasets are usually of low resolution and often lack ground - truth data, especially near object boundaries. To solve these problems, the author proposes a new framework named PatchRefiner. PatchRefiner improves high - resolution depth estimation in the following ways: - **Tile - based method**: Reconceptualizes the high - resolution depth - estimation task as a refinement process and adopts a tile - based method to handle high - resolution inputs. - **Pseudo - label strategy**: Utilizes synthetic data to generate pseudo - labels to overcome the problem of scarce real - world data. - **Detail - and - Scale - Decoupled Loss (DSD Loss)**: Introduces a new loss function that combines rank supervision and scale invariance, thereby effectively transferring knowledge from synthetic data to real - world data and enhancing the ability to capture details while maintaining scale accuracy. These improvements make PatchRefiner significantly outperform existing methods on multiple benchmark datasets. In particular, on the Unreal4KStereo synthetic dataset, its RMSE is reduced by 18.1% and its REL is reduced by 15.7%. Moreover, it also performs well on real - world datasets such as CityScape, ScanNet++ and ETH3D, significantly improving the accuracy of boundary details and the consistency of scale estimation. In summary, this paper aims to solve the challenges in high - resolution real - domain monocular depth estimation through innovative framework design and loss functions and improve the performance of the model in real - world scenarios.