Abstract:Monocular 3-D object detection is a challenging and important task in autonomous driving. In 3-D object detection, depth information are crucial for determining the position, size, and posture of objects. However, in the 2-D images obtained by monocular cameras, depth information are compressed into a plane, which blurs the relative position and size relationships between objects. In addition, the pose changes of objects in 3-D space (such as rotation and tilt) can cause changes in their projected shape, size, and position in 2-D images, which increases the difficulty of detecting 3-D objects from monocular 2-D images. In this article, we propose a depth-vision-decoupled transformer for monocular multiclass 3-D object detection. The proposed scheme is constructed with the following novel components: 1) a cascaded group-multiscale convolutional attention (CG-MSCA) with multiscale-perception and direction-perception capabilities to focus on the local characteristics of the complex scene; 2) a decoupled-feature-aware transformer (DFTR) module, which globally decouples depth and visual features, encodes and decodes them separately to avoid the monocularly estimated inaccurate depth interfering with the model's learning of visual information and alleviate the mismatch between stereo information and geometric shape of the object; and 3) a cross-attention-guided fusion module (CAFM) to rationalize the fusion of decoupled depth and visual features before prediction. Experiments on the KITTI, DAIR-V2X-V, and DAIR-V2X-I datasets show that our proposed method produces competitive performance compared with Transformer-based methods and other state-of-the-art (SOTA) methods. For the Car Category, our method achieves 17.04% average precision of 3-D ( ) for the moderate setting (the most important setting) of the KITTI dataset with an intersection over union (IoU) threshold of 0.7.

MonoPSTR: Monocular 3-D Object Detection with Dynamic Position and Scale-Aware Transformer

MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection

MonoDTR: Monocular 3D Object Detection with Depth-Aware Transformer

MonoDGP: Monocular 3D Object Detection with Decoupled-Query and Geometry-Error Priors

S$^3$-MonoDETR: Supervised Shape&Scale-perceptive Deformable Transformer for Monocular 3D Object Detection

MT-SSD: Single-Stage 3D Object Detector Based on Magnification Transformation

SSD-MonoDETR: Supervised Scale-aware Deformable Transformer for Monocular 3D Object Detection

MonoATT: Online Monocular 3D Object Detection with Adaptive Token Transformer

MonoDETRNext: Next-Generation Accurate and Efficient Monocular 3D Object Detector

MonoMM: A Multi-scale Mamba-Enhanced Network for Real-time Monocular 3D Object Detection

Depth-Vision-Decoupled Transformer With Cascaded Group Convolutional Attention for Monocular 3-D Object Detection

Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction

PVT-SSD: Single-Stage 3D Object Detector with Point-Voxel Transformer

MonoGRNet: A General Framework for Monocular 3D Object Detection

MonoAux: Fully Exploiting Auxiliary Information and Uncertainty for Monocular 3D Object Detection

Pseudo-Mono for Monocular 3D Object Detection in Autonomous Driving

MonoPGC: Monocular 3D Object Detection with Pixel Geometry Contexts

Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection

MonoTDP: Twin Depth Perception for Monocular 3D Object Detection in Adverse Scenes

Depth-Guided Vision Transformer With Normalizing Flows for Monocular 3D Object Detection

3DPPE: 3D Point Positional Encoding for Multi-Camera 3D Object Detection Transformers