Abstract:Monocular depth estimation (MDE) aims to predict pixel-level dense depth maps from a single RGB image. Some recent approaches mainly rely on encoder–decoder architectures to capture and process multi-scale features. However, they usually exploit heavier network at the expense of computational costs to obtain high-quality depth maps. In this paper, we propose a novel enriched multi-path vision transformer feature interaction network with an encoder–decoder architecture, denoted as EDFIDepth , which seeks a balance between computational costs and performance rather than pursuing the highest accuracy or extremely lightweight models. Specifically, an encoder called MPViT-D, incorporating multi-path vision transformer and a deep convolution module, is introduced to extract diverse features with both fine and coarse details at the same feature level with fewer parameters. Subsequently, we propose a lightweight decoder comprising two effective modules to establish multi-scale feature interaction: an encoder–decoder cross-feature matching (ED-CFM) module and a channel-level feature fusion (CLFF) module. The ED-CFM module is to establish connections between encoder–decoder features through a dual-path structure, where a cross-attention mechanism is deployed to enhance the relevance of multi-scale complementary depth information. Meanwhile, the CLFF module utilizes a channel attention mechanism to further fuse crucial depth information within the channels, thereby improving the accuracy of depth estimation. Extensive experiments on the indoor dataset NYUv2 and the outdoor dataset KITTI demonstrate that our method can achieve comparable state-of-the-art (SOTA) results while significantly reducing the number of trainable parameters. Our codes and approach are available at https://github.com/Zhangmg123/EDFIDEpth.

The Devil is in the Edges: Monocular Depth Estimation with Edge-aware Consistency Fusion

MFF-Net: Towards Efficient Monocular Depth Completion With Multi-Modal Feature Fusion

Edge-Enhanced Dual-Stream Perception Network for Monocular Depth Estimation

MonoER - A Edge Refined Self-Supervised Monocular Depth Estimation Method

Multi-feature fusion enhanced monocular depth estimation with boundary awareness

Edge Devices Friendly Self-Supervised Monocular Depth Estimation Via Knowledge Distillation.

Lightweight Monocular Depth Estimation with an Edge Guided Network

MFCS-Depth: an Economical Self-Supervised Monocular Depth Estimation Based on Multi-Scale Fusion and Channel Separation Attention

MSFNet:Multi-scale features network for monocular depth estimation

Depth Estimation from Monocular Images Using Dilated Convolution and Uncertainty Learning.

AggNet for Self-supervised Monocular Depth Estimation: Go an Aggressive Step Furthe.

Monocular Depth Estimation Based on Dilated Convolutions and Feature Fusion

DepthCut: Improved Depth Edge Estimation Using Multiple Unreliable Channels

CFDepthNet: Monocular Depth Estimation Introducing Coordinate Attention and Texture Features

Adaptive Weighted Network With Edge Enhancement Module For Monocular Self-Supervised Depth Estimation

Edge Enhancement in Monocular Depth Prediction

EDFIDepth: enriched multi-path vision transformer feature interaction networks for monocular depth estimation

Evitbins: Edge-Enhanced Vision-Transformer Bins for Monocular Depth Estimation on Edge Devices

Monocular depth estimation with hierarchical fusion of dilated CNNs and soft-weighted-sum inference

Unveiling the Depths: A Multi-Modal Fusion Framework for Challenging Scenarios

Illumination Insensitive Monocular Depth Estimation Based on Scene Object Attention and Depth Map Fusion.