RCDformer: Transformer-based dense depth estimation by sparse radar and camera

Xinyue Huang,Yongtao Ma,Zedong Yu,Haibo Zhao
DOI: https://doi.org/10.1016/j.neucom.2024.127668
IF: 6
2024-04-14
Neurocomputing
Abstract:Accurate depth cues are crucial for 3D perception tasks, and monocular depth estimation networks are no longer sufficient for realistic scenarios. Currently, the most effective approaches are to introduce depth information from other modalities into the image. Radar has become a popular sensor for fusion with cameras due to its low price and all-weather working characteristics. This paper aims to explore how to more effectively integrate the heterogeneous data of radar point clouds and RGB images to improve the performance of depth estimation. Most of the previous works have not fully exploited the potential of integrating these two modalities, so we propose RCDformer, a novel network based on the transformer architecture that fuses radar-camera for dense depth estimation. Without reducing the receptive field, our approach can fully model the contextual relationships between sensors to reduce the impact of radar noise on overall performance. With the proposed Radar-guided Multi-scale Depth Fusion (RGDF) module, the prior spatial information mapped by the Radar Feature Extractor (RFE) is embedded into a set of multi-scale hierarchical features output by Image Feature Extractor (IFE) via the modified deformable cross-attention, which aims to guide the depth prediction of images. Furthermore, we discover that incorporating the Radar Cross Section (RCS) attribute as an extended channel for the radar map is beneficial for dense depth estimation, which improves the overall performance of our model. We evaluate the proposed method on the nuScenes dataset, and the experiment results show that our method still achieves significant advantages in most metrics compared to the state-of-the-art models.
computer science, artificial intelligence
What problem does this paper attempt to address?