Abstract:Monocular depth estimation (MDE) aims to predict pixel-level dense depth maps from a single RGB image. Some recent approaches mainly rely on encoder–decoder architectures to capture and process multi-scale features. However, they usually exploit heavier network at the expense of computational costs to obtain high-quality depth maps. In this paper, we propose a novel enriched multi-path vision transformer feature interaction network with an encoder–decoder architecture, denoted as EDFIDepth , which seeks a balance between computational costs and performance rather than pursuing the highest accuracy or extremely lightweight models. Specifically, an encoder called MPViT-D, incorporating multi-path vision transformer and a deep convolution module, is introduced to extract diverse features with both fine and coarse details at the same feature level with fewer parameters. Subsequently, we propose a lightweight decoder comprising two effective modules to establish multi-scale feature interaction: an encoder–decoder cross-feature matching (ED-CFM) module and a channel-level feature fusion (CLFF) module. The ED-CFM module is to establish connections between encoder–decoder features through a dual-path structure, where a cross-attention mechanism is deployed to enhance the relevance of multi-scale complementary depth information. Meanwhile, the CLFF module utilizes a channel attention mechanism to further fuse crucial depth information within the channels, thereby improving the accuracy of depth estimation. Extensive experiments on the indoor dataset NYUv2 and the outdoor dataset KITTI demonstrate that our method can achieve comparable state-of-the-art (SOTA) results while significantly reducing the number of trainable parameters. Our codes and approach are available at https://github.com/Zhangmg123/EDFIDEpth.

Improving Depth Completion Via Depth Feature Upsampling

MFF-Net: Towards Efficient Monocular Depth Completion With Multi-Modal Feature Fusion

Up-to-Down Network: Fusing Multi-Scale Context for 3D Semantic Scene Completion

Least Square Estimation Network for Depth Completion

RigNet++: Semantic Assisted Repetitive Image Guided Network for Depth Completion

Efficient Depth Completion Network Based on Dynamic Gated Fusion

Gated Recurrent Fusion UNet for Depth Completion

AF2R Net: Adaptive Feature Fusion and Robust Network for Efficient and Precise Depth Completion

Fast and Accurate Depth Completion Method Based on Dynamic Gated Fusion Strategy

S&CNet: A Enhanced Coarse-to-fine Framework For Monocular Depth Completion

Depth Coefficients for Depth Completion

S&CNet: A Lightweight Network for Fast and Accurate Depth Completion.

Depth Completion in Autonomous Driving: Adaptive Spatial Feature Fusion and Semi-Quantitative Visualization

Mask-adaptive Gated Convolution and Bi-directional Progressive Fusion Network for Depth Completion

A Concise but High-performing Network for Image Guided Depth Completion in Autonomous Driving

PENet: Towards Precise and Efficient Image Guided Depth Completion

A Real-Time Semi-Dense Depth-Guided Depth Completion Network

A Concise but Effective Network for Image Guided Depth Completion in Autonomous Driving

Depth-Independent Depth Completion via Least Square Estimation

LRRU: Long-short Range Recurrent Updating Networks for Depth Completion

EDFIDepth: enriched multi-path vision transformer feature interaction networks for monocular depth estimation