Abstract:Monocular depth estimation using a single remote sensing image has emerged as a focal point in both remote sensing and computer vision research, proving crucial in tasks such as 3D reconstruction and target instance segmentation. Monocular depth estimation does not require multiple views as references, leading to significant improvements in both time and efficiency. Due to the complexity, occlusion, and uneven depth distribution of remote sensing images, there are currently few monocular depth estimation methods for remote sensing images. This paper proposes an approach to remote sensing monocular depth estimation that integrates an attention mechanism while considering global and local feature information. Leveraging a single remote sensing image as input, the method outputs end-to-end depth estimation for the corresponding area. In the encoder, the proposed method employs a dense neural network (DenseNet) feature extraction module with efficient channel attention (ECA), enhancing the capture of local information and details in remote sensing images. In the decoder stage, this paper proposes a dense atrous spatial pyramid pooling (DenseASPP) module with channel and spatial attention modules, effectively mitigating information loss and strengthening the relationship between the target's position and the background in the image. Additionally, weighted global guidance plane modules are introduced to fuse comprehensive features from different scales and receptive fields, finally predicting monocular depth for remote sensing images. Extensive experiments on the publicly available WHU-OMVS dataset demonstrate that our method yields better depth results in both qualitative and quantitative metrics.

Monocular depth estimation via cross-spectral stereo information fusion

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Monocular Depth Estimation Based on Unsupervised Learning

A Robust Monocular Depth Estimation Framework Based on Light-Weight ERF-Pspnet for Day-Night Driving Scenes

Reliable Fusion of ToF and Stereo Data Based on Joint Depth Filter

Expanding Sparse LiDAR Depth and Guiding Stereo Matching for Robust Dense Depth Estimation

Depth Estimation by Combining Binocular Stereo and Monocular Structured-Light

Cross-spectral stereo matching for facial disparity estimation in the dark

Adaptive Stereo Depth Estimation with Multi-Spectral Images Across All Lighting Conditions

Learning Monocular Depth by Distilling Cross-domain Stereo Networks

Unsupervised Cross-Spectrum Depth Estimation by Visible-Light and Thermal Cameras

Self-Supervised Monocular Depth Estimation Based on High-Order Spatial Interactions

Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging Scenarios

FusionDepth: Complement Self-Supervised Monocular Depth Estimation with Cost Volume

Stereo Matching by Self-supervision of Multiscopic Vision.

Self-Supervised Monocular Depth Estimation with Self-Reference Distillation and Disparity Offset Refinement

Deep eyes: Joint depth inference using monocular and binocular cues

Attention-Based Monocular Depth Estimation Considering Global and Local Information in Remote Sensing Images

Holistic and Contextual Evidential Stereo-LiDAR Fusion for Depth Estimation

Unsupervised Visible-light Images Guided Cross-Spectrum Depth Estimation from Dual-Modality Cameras

Edge-preserving photometric stereo via depth fusion