Abstract:In this work, we propose a novel one-stage and keypoint-based framework for monocular 3D object detection using only RGB images, called KM3D-Net. 2D detection only requires a deep neural network to predict 2D properties of objects, as it is a semanticity-aware task. For image-based 3D detection, we argue that the combination of a deep neural network and geometric constraints are needed to synergistically estimate appearance-related and spatial-related information. Here, we design a fully convolutional model to predict object keypoints, dimension, and orientation, and combine these with perspective geometry constraints to compute position attributes. Further, we reformulate the geometric constraints as a differentiable version and embed this in the network to reduce running time while maintaining the consistency of model outputs in an end-to-end fashion. Benefiting from this simple structure, we propose an effective semi-supervised training strategy for settings where labeled training data are scarce. In this strategy, we enforce a consensus prediction of two shared-weights KM3D-Net for the same unlabeled image under different input augmentation conditions and network regularization. In particular, we unify the coordinate-dependent augmentations as the affine transformation for the differential recovering position of objects, and propose a keypoint-dropout module for network regularization. Our model only requires RGB images, without synthetic data, instance segmentation, CAD model, or depth generator. Extensive experiments on the popular KITTI 3D detection dataset indicate that the KM3D-Net surpasses state-of-the-art methods by a large margin in both efficiency and accuracy. And also, to the best of our knowledge, this is the first application of semi-supervised learning in monocular 3D object detection. We surpass most of the previous fully supervised methods with only 13% labeled data on KITTI.

Monocular 3D Detection With Geometric Constraint Embedding and Semi-Supervised Training

Weakly Supervised Monocular 3D Detection with a Single-View Image

3D Object Aided Self-Supervised Monocular Depth Estimation

Monocular 3D object detection via estimation of paired keypoints for autonomous driving

MDS-Net: Multi-Scale Depth Stratification 3D Object Detection from Monocular Images

Depth-Enhancement Network for Monocular 3D object detection

MonoGRNet: A General Framework for Monocular 3D Object Detection

Self-supervised 3D Object Detection from Monocular Pseudo-LiDAR

SGM3D: Stereo Guided Monocular 3D Object Detection

Point-Guided Contrastive Learning for Monocular 3-D Object Detection

Boosting Monocular 3D Object Detection with Object-Centric Auxiliary Depth Supervision

Depth Is All You Need for Monocular 3D Detection

Reinforced Axial Refinement Network for Monocular 3D Object Detection

RTM3D: Real-Time Monocular 3D Detection from Object Keypoints for Autonomous Driving

MonoMM: A Multi-scale Mamba-Enhanced Network for Real-time Monocular 3D Object Detection

M3D-RPN: Monocular 3D Region Proposal Network for Object Detection

Monocular 3D Object Detection with Pseudo-LiDAR Point Cloud

MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection

Multi-view 3D Object Detection Network for Autonomous Driving

Monocular 3D object detection using dual quadric for autonomous driving

MSL3D: 3D object detection from monocular, stereo and point cloud for autonomous driving