Abstract:Monocular 3D object detection (Mono3OD) is a challenging yet cost-effective vision task in the fields of autonomous driving and mobile robotics. The lack of reliable depth information makes obtaining accurate 3D positional information extremely difficult. In recent years, center-guided monocular 3D object detectors have directly regressed the absolute depth of the object center based on 2D detection. However, this approach heavily relies on local semantic information, ignoring contextual spatial cues and global-to-local visual correlations. Moreover, visual variations in the scene can lead to inevitable depth prediction errors for objects at different scales. To address these limitations, we propose a Mono3OD framework based on scene-level adaptive instance depth estimation (MonoSAID). Firstly, the continuous depth is discretized into multiple bins, and the width distribution of depth bins is adaptively generated based on scene-level contextual semantic information. Then, by establishing the correlation between global contextual semantic feature information and local semantic features of instances, and using the probability distribution representation of local instance features and the linear combination of bin centers distributions to solve the depth problem. In addition, a multi-scale spatial perception attention module is designed to extract attention maps of various scales through pyramid pooling operations. This design enhances the model’s receptive field and multi-scale spatial perception capabilities, thereby improving its ability to model target objects. We conducted extensive experiments on the KITTI dataset and the Waymo dataset. The results show that MonoSAID can effectively improve the 3D detection accuracy and robustness, and our method achieves state-of-the-art performance.

Revisiting Monocular 3D Object Detection from Scene-Level Depth Retargeting to Instance-Level Spatial Refinement

Leveraging Front and Side Cues for Occlusion Handling in Monocular 3D Object Detection

An Algorithm on Monocular 3D Object Detection Based on Depth Estimation

DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection

Aug3D-RPN: Improving Monocular 3D Object Detection by Synthetic Images with Virtual Depth

Depth-Enhancement Network for Monocular 3D object detection

MonoSAID: Monocular 3D Object Detection Based on Scene-Level Adaptive Instance Depth Estimation

Attention-Based Depth Distillation with 3D-Aware Positional Encoding for Monocular 3D Object Detection

Depth Is All You Need for Monocular 3D Detection

Diversity Matters: Fully Exploiting Depth Clues for Reliable Monocular 3D Object Detection.

MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection

MonoDistill: Learning Spatial Features for Monocular 3D Object Detection

3D Object Aided Self-Supervised Monocular Depth Estimation

Reinforced Axial Refinement Network for Monocular 3D Object Detection

Monocular 3D Object Detection With Sequential Feature Association and Depth Hint Augmentation

Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation

MonoCD: Monocular 3D Object Detection with Complementary Depths

Boosting Monocular 3D Object Detection with Object-Centric Auxiliary Depth Supervision

Depth Dynamic Center Difference Convolutions for Monocular 3D Object Detection.

Dynamic Depth Fusion and Transformation for Monocular 3D Object Detection.

Depth Estimation Matters Most: Improving Per-Object Depth Estimation for Monocular 3D Detection and Tracking