Abstract:In the field of monocular depth estimation (MDE), many models with excellent zero-shot performance in general scenes emerge recently. However, these methods often fail in predicting non-Lambertian surfaces, such as transparent or mirror (ToM) surfaces, due to the unique reflective properties of these regions. Previous methods utilize externally provided ToM masks and aim to obtain correct depth maps through direct in-painting of RGB images. These methods highly depend on the accuracy of additional input masks, and the use of random colors during in-painting makes them insufficiently robust. We are committed to incrementally enabling the baseline model to directly learn the uniqueness of non-Lambertian surface regions for depth estimation through a well-designed training framework. Therefore, we propose non-Lambertian surface regional guidance, which constrains the predictions of MDE model from the gradient domain to enhance its robustness. Noting the significant impact of lighting on this task, we employ the random tone-mapping augmentation during training to ensure the network can predict correct results for varying lighting inputs. Additionally, we propose an optional novel lighting fusion module, which uses Variational Autoencoders to fuse multiple images and obtain the most advantageous input RGB image for depth estimation when multi-exposure images are available. Our method achieves accuracy improvements of 33.39% and 5.21% in zero-shot testing on the Booster and Mirror3D dataset for non-Lambertian surfaces, respectively, compared to the Depth Anything V2. The state-of-the-art performance of 90.75 in delta1.05 within the ToM regions on the TRICKY2024 competition test set demonstrates the effectiveness of our approach.

Illumination Insensitive Monocular Depth Estimation Based on Scene Object Attention and Depth Map Fusion.

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

MFF-Net: Towards Efficient Monocular Depth Completion With Multi-Modal Feature Fusion

A Robust Monocular Depth Estimation Framework Based on Light-Weight ERF-Pspnet for Day-Night Driving Scenes

Monocular Depth Estimation Based on Unsupervised Learning

Delving into Multi-illumination Monocular Depth Estimation: A New Dataset and Method

Monocular Depth Estimation Based on Dilated Convolutions and Feature Fusion

Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging Scenarios

Adaptive Semantic Fusion Framework for Unsupervised Monocular Depth Estimation

MBUDepthNet: Real-Time Unsupervised Monocular Depth Estimation Method for Outdoor Scenes

Self-supervised Monocular Depth Estimation with Uncertainty-aware Feature Enhancement and Depth Fusion

MDSNet: self-supervised monocular depth estimation for video sequences using self-attention and threshold mask

Towards Robust Monocular Depth Estimation in Non-Lambertian Surfaces

PCTDepth: Exploiting Parallel CNNs and Transformer via Dual Attention for Monocular Depth Estimation

MFCS-Depth: an Economical Self-Supervised Monocular Depth Estimation Based on Multi-Scale Fusion and Channel Separation Attention

Unsupervised Monocular Depth Estimation Based on Dual Attention Mechanism and Depth-Aware Loss

Multi-feature fusion enhanced monocular depth estimation with boundary awareness

AggNet for Self-supervised Monocular Depth Estimation: Go an Aggressive Step Furthe.

GlobalDepth: Global-Aware Attention Model for Unsupervised Monocular Depth Estimation.

Diversity Matters: Fully Exploiting Depth Clues for Reliable Monocular 3D Object Detection.

Depth Estimation Matters Most: Improving Per-Object Depth Estimation for Monocular 3D Detection and Tracking