Towards 3D Object Detection with 2D Supervision

Jinrong Yang,Tiancai Wang,Zheng Ge,Weixin Mao,Xiaoping Li,Xiangyu Zhang
DOI: https://doi.org/10.48550/arXiv.2211.08287
2022-11-16
Abstract:The great progress of 3D object detectors relies on large-scale data and 3D annotations. The annotation cost for 3D bounding boxes is extremely expensive while the 2D ones are easier and cheaper to collect. In this paper, we introduce a hybrid training framework, enabling us to learn a visual 3D object detector with massive 2D (pseudo) labels, even without 3D annotations. To break through the information bottleneck of 2D clues, we explore a new perspective: Temporal 2D Supervision. We propose a temporal 2D transformation to bridge the 3D predictions with temporal 2D labels. Two steps, including homography wraping and 2D box deduction, are taken to transform the 3D predictions into 2D ones for supervision. Experiments conducted on the nuScenes dataset show strong results (nearly 90% of its fully-supervised performance) with only 25% 3D annotations. We hope our findings can provide new insights for using a large number of 2D annotations for 3D perception.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to reduce the dependence on expensive 3D annotation data in 3D object detection tasks. Specifically, the author proposes a hybrid training framework, aiming to use a large amount of 2D annotation data (i.e., 2D bounding boxes and class labels) to train 3D camera - based detectors, which can be achieved even without 3D annotation data. The core of this method lies in converting the predicted 3D bounding boxes into 2D forms through the Temporal 2D Supervision technique, so that supervised learning can be carried out using 2D annotations. This method not only significantly reduces the annotation cost but also provides a new way to train 3D detectors without using LiDAR point cloud data. ### Main Contributions 1. **Hybrid Training Framework**: A hybrid training framework is proposed that can use a large amount of 2D annotation data and a small amount of 3D annotation data (or even without using 3D annotation data at all) to train 3D vision detectors. 2. **Temporal 2D Supervision**: The Temporal 2D Supervision technique is introduced. Through Homography Warping and 2D Box Deduction, 3D predictions are converted into 2D forms for easy supervision using 2D annotations. 3. **Motion Blur Processing**: Aiming at the offset problem in Temporal 2D Supervision caused by moving objects, symmetric Temporal 2D Supervision and an appropriate time - interval selection strategy are proposed to reduce the supervision deviation. 4. **Experimental Verification**: Extensive experiments are carried out on the nuScenes dataset. The results show that with only 25% of 3D annotation data, this method can achieve 90% of the performance close to that of full - supervision. ### Method Overview 1. **Homography Warping**: Given the 3D predicted bounding box of the current frame, use the homography matrix to transform it into the camera coordinate system of the adjacent frame. 2. **2D Box Deduction**: Project the transformed 3D bounding box onto the image plane, extract 8 corner points, and calculate the minimum bounding rectangle of these corner points to obtain the 2D bounding box. 3. **Supervision Loss**: A hybrid loss function is designed, combining 3D supervision and 2D supervision, and controlling the supervision proportion in the training process by adjusting the weights of the two. ### Experimental Results - **Performance**: The experimental results on the nuScenes dataset show that with only 25% of 3D annotation data, the performance of this method reaches 90% of the full - supervision performance. - **Robustness**: The effectiveness of the method is gradually verified by removing 3D supervision. Especially when dealing with moving objects, the symmetric Temporal 2D Supervision and the appropriate time - interval selection strategy significantly improve the robustness of the model. ### Conclusion The method proposed in this paper provides an cost - effective solution for 3D object detection tasks. By using a large amount of cheap 2D annotation data, it significantly reduces the annotation cost while maintaining high detection performance. This method is of great significance in practical applications, especially in the construction of large - scale datasets and model training.