Abstract:Monocular 3-D object detection is a challenging and important task in autonomous driving. In 3-D object detection, depth information are crucial for determining the position, size, and posture of objects. However, in the 2-D images obtained by monocular cameras, depth information are compressed into a plane, which blurs the relative position and size relationships between objects. In addition, the pose changes of objects in 3-D space (such as rotation and tilt) can cause changes in their projected shape, size, and position in 2-D images, which increases the difficulty of detecting 3-D objects from monocular 2-D images. In this article, we propose a depth-vision-decoupled transformer for monocular multiclass 3-D object detection. The proposed scheme is constructed with the following novel components: 1) a cascaded group-multiscale convolutional attention (CG-MSCA) with multiscale-perception and direction-perception capabilities to focus on the local characteristics of the complex scene; 2) a decoupled-feature-aware transformer (DFTR) module, which globally decouples depth and visual features, encodes and decodes them separately to avoid the monocularly estimated inaccurate depth interfering with the model's learning of visual information and alleviate the mismatch between stereo information and geometric shape of the object; and 3) a cross-attention-guided fusion module (CAFM) to rationalize the fusion of decoupled depth and visual features before prediction. Experiments on the KITTI, DAIR-V2X-V, and DAIR-V2X-I datasets show that our proposed method produces competitive performance compared with Transformer-based methods and other state-of-the-art (SOTA) methods. For the Car Category, our method achieves 17.04% average precision of 3-D ( ) for the moderate setting (the most important setting) of the KITTI dataset with an intersection over union (IoU) threshold of 0.7.

Monocular 3D Detection for Autonomous Vehicles by Cascaded Geometric Constraints and Depurated Using 3D Results

3D Bounding Box Estimation for Autonomous Vehicles by Cascaded Geometric Constraints and Depurated 2D Detections Using 3D Results

Leveraging Front and Side Cues for Occlusion Handling in Monocular 3D Object Detection

Monocular 3-D Vehicle Detection Using a Cascade Network for Autonomous Driving

A Multi-view 3D Vehicle Detection Method Based On Novel 3D Proposal Generation Method

Monocular 3D object detection via estimation of paired keypoints for autonomous driving

Accurate Monocular Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving

Monocular 3D object detection using dual quadric for autonomous driving

Monocular 3D Object Detection: An Extrinsic Parameter Free Approach

3D Detection for Occluded Vehicles From Point Clouds

Image Guidance Based 3D Vehicle Detection in Traffic Scene.

Monocular 3D Object Detection with Decoupled Structured Polygon Estimation and Height-Guided Depth Estimation

Joint Monocular 3D Vehicle Detection and Tracking

Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

RTM3D: Real-Time Monocular 3D Detection from Object Keypoints for Autonomous Driving

Ground-aware Monocular 3D Object Detection for Autonomous Driving

SGM3D: Stereo Guided Monocular 3D Object Detection

Depth-Vision-Decoupled Transformer With Cascaded Group Convolutional Attention for Monocular 3-D Object Detection

Roadside Monocular 3D Detection via 2D Detection Prompting

6DoF-3D: Efficient and accurate 3D object detection using six degrees-of-freedom for autonomous driving