VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection

Zihua Liu,Hiroki Sakuma,Masatoshi Okutomi

2024-03-30

Abstract:Monocular 3D object detection poses a significant challenge in 3D scene understanding due to its inherently ill-posed nature in monocular depth estimation. Existing methods heavily rely on supervised learning using abundant 3D labels, typically obtained through expensive and labor-intensive annotation on LiDAR point clouds. To tackle this problem, we propose a novel weakly supervised 3D object detection framework named VSRD (Volumetric Silhouette Rendering for Detection) to train 3D object detectors without any 3D supervision but only weak 2D supervision. VSRD consists of multi-view 3D auto-labeling and subsequent training of monocular 3D object detectors using the pseudo labels generated in the auto-labeling stage. In the auto-labeling stage, we represent the surface of each instance as a signed distance field (SDF) and render its silhouette as an instance mask through our proposed instance-aware volumetric silhouette rendering. To directly optimize the 3D bounding boxes through rendering, we decompose the SDF of each instance into the SDF of a cuboid and the residual distance field (RDF) that represents the residual from the cuboid. This mechanism enables us to optimize the 3D bounding boxes in an end-to-end manner by comparing the rendered instance masks with the ground truth instance masks. The optimized 3D bounding boxes serve as effective training data for 3D object detection. We conduct extensive experiments on the KITTI-360 dataset, demonstrating that our method outperforms the existing weakly supervised 3D object detection methods. The code is available at

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### Problems the Paper Attempts to Solve The paper aims to address the challenging issue in monocular 3D object detection, particularly under the inherently difficult conditions of monocular depth estimation. Existing methods heavily rely on supervised learning achieved through extensive and time-consuming manual annotation on LiDAR point clouds. This high-cost annotation becomes a significant barrier to deploying 3D object detectors in autonomous driving systems. To tackle this problem, the authors propose a novel weakly supervised 3D object detection framework called VSRD (Volumetric Silhouette Rendering for Detection). This framework enables training 3D object detectors without 3D supervision, requiring only weak 2D supervision. Specifically, VSRD includes multi-view 3D automatic annotation and subsequent training of monocular 3D object detectors using automatically generated pseudo-labels. Through instance-aware volumetric silhouette rendering and SDF decomposition mechanisms, this method can optimize 3D bounding boxes and use them as effective training data for 3D object detection. Experimental results show that this method outperforms existing weakly supervised 3D object detection methods on the KITTI-360 dataset.

VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection

3D-SSD: Learning Hierarchical Features from RGB-D Images for Amodal 3D Object Detection

Leveraging Front and Side Cues for Occlusion Handling in Monocular 3D Object Detection

Spatial Likelihood Voting with Self-Knowledge Distillation for Weakly Supervised Object Detection.

Weakly Supervised 3D Object Detection via Multi-Level Visual Guidance

Monocular Differentiable Rendering for Self-Supervised 3D Object Detection

Weakly Supervised 3D Object Detection from Point Clouds

Pyogenic granuloma of the small bowel

Back to Reality: Learning Data-Efficient 3D Object Detector with Shape Guidance.

WeakM3D: Towards Weakly Supervised Monocular 3D Object Detection

SVDM: Single-View Diffusion Model for Pseudo-Stereo 3D Object Detection

Training an Open-Vocabulary Monocular 3D Object Detection Model without 3D Data

DID-M3D: Decoupling Instance Depth for Monocular 3D Object Detection

View-to-Label: Multi-View Consistency for Self-Supervised 3D Object Detection

Weakly Supervised Monocular 3D Detection with a Single-View Image

General Geometry-aware Weakly Supervised 3D Object Detection

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

Towards A Weakly Supervised Framework for 3D Point Cloud Object Detection and Annotation

Suppress-and-Refine Framework for End-to-End 3D Object Detection

ODM3D: Alleviating Foreground Sparsity for Semi-Supervised Monocular 3D Object Detection

Object as Query: Lifting any 2D Object Detector to 3D Detection