Abstract:Monocular 3D object detection plays a crucial role In the field of self-driving cars, estimating the size and location of objects solely based on input images. However, a notable disparity exists between the training and inference of 3D object detectors. This discrepancy arises because during inference, monocular 3D detectors depend solely on images captured by cameras; while during training, these methods require 3D ground truths labeled on point cloud data, which is obtained using specialized devices like LiDAR. This discrepancy creates a break in the data loop, preventing the feedback data from production cars from being utilized to enhance the robustness of the detectors. To address this issue and establish a connection in the data loop, we present a weakly-supervised solution that trains monocular 3D object detectors solely using 2D labels, eliminating the requirement for 3D ground truths. Our approach considers two view consistency: spatial and temporal view consistency, which play a crucial role in regulating the prediction of 3D bounding boxes. Spatial view consistency is achieved by employing projection and multi-view consistency techniques to guide the optimization of the target's location and size. We leverage temporal viewpoint consistency to provide temporal multi-view image pairs, and we further introduce temporal movement consistency to tackle the challenge of dynamic scenes. With only 2D ground truths, our method achieves comparable performance to fully supervised methods. Additionally, our method can be employed as a pre-training method and achieves significant improvement when fine-tuned with a small proportion of fully supervised labels.

Self-supervised 3D Vehicle Detection Based on Monocular Images

Monocular Depth Estimation Based on Unsupervised Learning

Self-supervised 3D Object Detection from Monocular Pseudo-LiDAR

3D Object Aided Self-Supervised Monocular Depth Estimation

View-to-Label: Multi-View Consistency for Self-Supervised 3D Object Detection

Research on Self-Supervised Depth Estimation Algorithm of Driving Scene Based on Monocular Vision.

Monocular 3-D Vehicle Detection Using a Cascade Network for Autonomous Driving

Ground-aware Monocular 3D Object Detection for Autonomous Driving

Monocular Differentiable Rendering for Self-Supervised 3D Object Detection

Accurate Monocular Object Detection via Color-Embedded 3D Reconstruction for Autonomous Driving

MonoLSS: Learnable Sample Selection For Monocular 3D Detection

Sample, Crop, Track: Self-Supervised Mobile 3D Object Detection for Urban Driving LiDAR

Accurate Monocular 3D Object Detection Via Color-Embedded 3D Reconstruction for Autonomous Driving.

Weakly Supervised Monocular 3D Object Detection by Spatial-Temporal View Consistency

Point-Guided Contrastive Learning for Monocular 3-D Object Detection

An Algorithm on Monocular 3D Object Detection Based on Depth Estimation

Monocular 3D object detection via estimation of paired keypoints for autonomous driving

Unsupervised Domain Adaptation for Monocular 3D Object Detection Via Self-training

MP-Mono: Monocular 3D Detection Using Multiple Priors for Autonomous Driving

Learning 2D to 3D Lifting for Object Detection in 3D for Autonomous Vehicles

Self-Supervised 3D Reconstruction and Ego-Motion Estimation Via On-Board Monocular Video