Abstract:Monocular depth estimation is a crucial task to measure distance relative to a camera, which is important for applications, such as robot navigation and self-driving. Traditional frame-based methods suffer from performance drops due to the limited dynamic range and motion blur. Therefore, recent works leverage novel event cameras to complement or guide the frame modality via frame-event feature fusion. However, event streams exhibit spatial sparsity, leaving some areas unperceived, especially in regions with marginal light changes. Therefore, direct fusion methods, e.g., RAMNet, often ignore the contribution of the most confident regions of each modality. This leads to structural ambiguity in the modality fusion process, thus degrading the depth estimation performance. In this paper, we propose a novel Spatial Reliability-oriented Fusion Network (SRFNet), that can estimate depth with fine-grained structure at both daytime and nighttime. Our method consists of two key technical components. Firstly, we propose an attention-based interactive fusion (AIF) module that applies spatial priors of events and frames as the initial masks and learns the consensus regions to guide the inter-modal feature fusion. The fused feature are then fed back to enhance the frame and event feature learning. Meanwhile, it utilizes an output head to generate a fused mask, which is iteratively updated for learning consensual spatial priors. Secondly, we propose the Reliability-oriented Depth Refinement (RDR) module to estimate dense depth with the fine-grained structure based on the fused features and masks. We evaluate the effectiveness of our method on the synthetic and real-world datasets, which shows that, even without pretraining, our method outperforms the prior methods, e.g., RAMNet, especially in night scenes. Our project homepage: <a class="link-external link-https" href="https://vlislab22.github.io/SRFNet" rel="external noopener nofollow">this https URL</a>.

RCDformer: Transformer-based dense depth estimation by sparse radar and camera

A Depth Estimation Framework Based on Unsupervised Learning and Cross-Modal Translation

Semantic-guided Depth Completion from Monocular Images and 4D Radar Data

RadarCam-Depth: Radar-Camera Fusion for Depth Estimation with Learned Metric Scale

RCDPT: Radar-Camera fusion Dense Prediction Transformer

RaViDeep: Target Detection Based on Deep Fusion of Radar and Vision in Berthing Scenarios

Depth Estimation from Monocular Images and Sparse Radar Data

Region Deformer Networks for Unsupervised Depth Estimation from Unconstrained Monocular Videos

RCBEVDet++: Toward High-accuracy Radar-Camera Fusion 3D Perception Network

RIDERS: Radar-Infrared Depth Estimation for Robust Sensing

CaFNet: A Confidence-Driven Framework for Radar Camera Depth Estimation

Radar-Camera Pixel Depth Association for Depth Completion

RADIANT: Radar-Image Association Network for 3D Object Detection

LRCFormer: lightweight transformer based radar-camera fusion for 3D target detection

DepthFormer: Exploiting Long-range Correlation and Local Information for Accurate Monocular Depth Estimation

DELTAR: Depth Estimation from a Light-Weight ToF Sensor and RGB Image

RCBEVDet: Radar-camera Fusion in Bird's Eye View for 3D Object Detection

URCDC-Depth: Uncertainty Rectified Cross-Distillation with CutFlip for Monocular Depth Estimation

SparseFusion3D: Sparse Sensor Fusion for 3D object detection by Radar and Camera in Environmental Perception

SRFNet: Monocular Depth Estimation with Fine-grained Structure via Spatial Reliability-oriented Fusion of Frames and Events

Radar Enlighten the Dark: Enhancing Low-Visibility Perception for Automated Vehicles with Camera-Radar Fusion