Abstract:Monocular 3-D object detection is a challenging and important task in autonomous driving. In 3-D object detection, depth information are crucial for determining the position, size, and posture of objects. However, in the 2-D images obtained by monocular cameras, depth information are compressed into a plane, which blurs the relative position and size relationships between objects. In addition, the pose changes of objects in 3-D space (such as rotation and tilt) can cause changes in their projected shape, size, and position in 2-D images, which increases the difficulty of detecting 3-D objects from monocular 2-D images. In this article, we propose a depth-vision-decoupled transformer for monocular multiclass 3-D object detection. The proposed scheme is constructed with the following novel components: 1) a cascaded group-multiscale convolutional attention (CG-MSCA) with multiscale-perception and direction-perception capabilities to focus on the local characteristics of the complex scene; 2) a decoupled-feature-aware transformer (DFTR) module, which globally decouples depth and visual features, encodes and decodes them separately to avoid the monocularly estimated inaccurate depth interfering with the model's learning of visual information and alleviate the mismatch between stereo information and geometric shape of the object; and 3) a cross-attention-guided fusion module (CAFM) to rationalize the fusion of decoupled depth and visual features before prediction. Experiments on the KITTI, DAIR-V2X-V, and DAIR-V2X-I datasets show that our proposed method produces competitive performance compared with Transformer-based methods and other state-of-the-art (SOTA) methods. For the Car Category, our method achieves 17.04% average precision of 3-D ( ) for the moderate setting (the most important setting) of the KITTI dataset with an intersection over union (IoU) threshold of 0.7.

Dynamic Depth Fusion and Transformation for Monocular 3D Object Detection.

Leveraging Front and Side Cues for Occlusion Handling in Monocular 3D Object Detection

Depth-Enhancement Network for Monocular 3D object detection

Depth Estimation Matters Most: Improving Per-Object Depth Estimation for Monocular 3D Detection and Tracking

Depth Is All You Need for Monocular 3D Detection

Aug3D-RPN: Improving Monocular 3D Object Detection by Synthetic Images with Virtual Depth

MonoAux: Fully Exploiting Auxiliary Information and Uncertainty for Monocular 3D Object Detection

Diversity Matters: Fully Exploiting Depth Clues for Reliable Monocular 3D Object Detection.

Depth-Vision-Decoupled Transformer With Cascaded Group Convolutional Attention for Monocular 3-D Object Detection

Fine-Grained Multilevel Fusion for Anti-Occlusion Monocular 3D Object Detection

DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection

Multi-Modal Fusion Based on Depth Adaptive Mechanism for 3D Object Detection

3D Object Aided Self-Supervised Monocular Depth Estimation

Depth-Guided Vision Transformer With Normalizing Flows for Monocular 3D Object Detection

Depth-conditioned Dynamic Message Propagation for Monocular 3D Object Detection

Adaptive Semantic Fusion Framework for Unsupervised Monocular Depth Estimation

Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation

Kinematic 3D Object Detection in Monocular Video

Accurate Monocular Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving

Temporal Feature Fusion for 3D Detection in Monocular Video

DyFusion: Cross-Attention 3D Object Detection with Dynamic Fusion