Abstract:Monocular depth estimation using a single remote sensing image has emerged as a focal point in both remote sensing and computer vision research, proving crucial in tasks such as 3D reconstruction and target instance segmentation. Monocular depth estimation does not require multiple views as references, leading to significant improvements in both time and efficiency. Due to the complexity, occlusion, and uneven depth distribution of remote sensing images, there are currently few monocular depth estimation methods for remote sensing images. This paper proposes an approach to remote sensing monocular depth estimation that integrates an attention mechanism while considering global and local feature information. Leveraging a single remote sensing image as input, the method outputs end-to-end depth estimation for the corresponding area. In the encoder, the proposed method employs a dense neural network (DenseNet) feature extraction module with efficient channel attention (ECA), enhancing the capture of local information and details in remote sensing images. In the decoder stage, this paper proposes a dense atrous spatial pyramid pooling (DenseASPP) module with channel and spatial attention modules, effectively mitigating information loss and strengthening the relationship between the target's position and the background in the image. Additionally, weighted global guidance plane modules are introduced to fuse comprehensive features from different scales and receptive fields, finally predicting monocular depth for remote sensing images. Extensive experiments on the publicly available WHU-OMVS dataset demonstrate that our method yields better depth results in both qualitative and quantitative metrics.

Depth Estimation of Multi-Modal Scene Based on Multi-Scale Modulation

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

MFF-Net: Towards Efficient Monocular Depth Completion With Multi-Modal Feature Fusion

Monocular Depth Estimation Based on Multi-Scale Graph Convolution Networks

Semantic Reconstruction based on RGB Image and Sparse Depth

Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging Scenarios

Self-Supervised Monocular Depth Estimation With Multiscale Perception

Depth Estimation from Multi-Scale SLIC Superpixels Using Non-Parametric Learning

Towards Scale-Aware Self-Supervised Multi-Frame Depth Estimation with IMU Motion Dynamics.

MSFNet:Multi-scale features network for monocular depth estimation

Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks

MMSS: Multi-modal Sharable and Specific Feature Learning for RGB-D Object Recognition.

Probabilistic Multimodal Depth Estimation Based on Camera-LiDAR Sensor Fusion

Learning to Fuse Monocular and Multi-view Cues for Multi-frame Depth Estimation in Dynamic Scenes

M${^2}$Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation

Attention-Based Monocular Depth Estimation Considering Global and Local Information in Remote Sensing Images

Exploring the Mutual Influence between Self-Supervised Single-Frame and Multi-Frame Depth Estimation

Incremental Joint Learning of Depth, Pose and Implicit Scene Representation on Monocular Camera in Large-scale Scenes

M^2Depth: Self-supervised Two-Frame Multi-camera Metric Depth Estimation

Monocular depth estimation with hierarchical fusion of dilated CNNs and soft-weighted-sum inference

Depth Images Could Tell Us More: Enhancing Depth Discriminability for RGB-D Scene Recognition