Jianhao Li,Tianyu Sun,Zhongdao Wang,Enze Xie,Bailan Feng,Hongbo Zhang,Ze Yuan,Ke Xu,Jiaheng Liu,Ping Luo
Abstract:This paper proposes an algorithm for automatically labeling 3D objects from 2D point or box prompts, especially focusing on applications in autonomous driving. Unlike previous arts, our auto-labeler predicts 3D shapes instead of bounding boxes and does not require training on a specific dataset. We propose a Segment, Lift, and Fit (SLF) paradigm to achieve this goal. Firstly, we segment high-quality instance masks from the prompts using the Segment Anything Model (SAM) and transform the remaining problem into predicting 3D shapes from given 2D masks. Due to the ill-posed nature of this problem, it presents a significant challenge as multiple 3D shapes can project into an identical mask. To tackle this issue, we then lift 2D masks to 3D forms and employ gradient descent to adjust their poses and shapes until the projections fit the masks and the surfaces conform to surrounding LiDAR points. Notably, since we do not train on a specific dataset, the SLF auto-labeler does not overfit to biased annotation patterns in the training set as other methods do. Thus, the generalization ability across different datasets improves. Experimental results on the KITTI dataset demonstrate that the SLF auto-labeler produces high-quality bounding box annotations, achieving an AP@0.5 IoU of nearly 90\%. Detectors trained with the generated pseudo-labels perform nearly as well as those trained with actual ground-truth annotations. Furthermore, the SLF auto-labeler shows promising results in detailed shape predictions, providing a potential alternative for the occupancy annotation of dynamic objects.
What problem does this paper attempt to address?
### Problems the Paper Attempts to Solve
This paper aims to address the automation of 3D object annotation in the field of autonomous driving. Specifically, the authors propose a method to automatically generate 3D object shape labels from 2D cues (points or boxes). Unlike existing methods, this approach not only predicts 3D bounding boxes but also detailed 3D shapes, and it does not require training on specific datasets. This gives the method better generalization capabilities, allowing it to perform well across different datasets.
### Background and Challenges
1. **Need for 3D Annotation**: Modern robotics and autonomous driving systems require a large amount of annotated data to understand 3D scenes, especially the annotation of dynamic objects such as vehicles and pedestrians.
2. **Difficulty of Manual Annotation**: Manually annotating a large number of 3D bounding boxes is a tedious and costly task, limiting the scalability of 3D object detectors.
3. **Need for Fine-Grained Annotation**: With the development of 3D perception models, there is an increasing demand for finer annotation granularity, such as voxel occupancy, but these fine-grained annotations further complicate the annotation process, reducing efficiency.
### Proposed Method
The authors propose a new method called Segment, Lift, and Fit (SLF) for automatic 3D annotation. The specific steps are as follows:
1. **Segment**: Using the input 2D cues (points or boxes), generate high-quality instance masks through the Segment Anything Model (SAM).
2. **Lift**: Lift the 2D instance masks to 3D form, representing the 3D objects using Signed Distance Function (SDF).
3. **Fit**: Iteratively optimize the shape and pose of the 3D objects through gradient descent until their projection aligns with the 2D masks and surrounding LiDAR points.
### Main Contributions
1. **Detailed Shape Prediction**: SLF not only predicts 3D bounding boxes but also detailed 3D shapes, improving annotation accuracy.
2. **No Training Required**: SLF does not rely on supervised training on specific datasets, avoiding overfitting issues and providing better generalization capabilities.
3. **Efficient Annotation**: Experimental results show that SLF generates high-quality 3D labels on the KITTI dataset, with AP@0.5 IoU close to 90%, and the performance of detectors trained with the generated pseudo-labels is close to that of detectors trained with real labels.
### Experimental Results
1. **Comparison with Unsupervised Auto-Labelers**: SLF outperforms other unsupervised auto-labelers on the KITTI validation set, especially on moderate and hard samples.
2. **Cross-Dataset Generalization**: On the more challenging nuScenes dataset, SLF outperforms supervised auto-labelers like MTrans, particularly in mAP and NDS metrics.
3. **Detector Performance**: Detectors trained with pseudo-labels generated by SLF outperform those trained with pseudo-labels generated by FGR across multiple metrics.
### Conclusion
This paper proposes a method called SLF to automatically generate 3D object shape labels from 2D cues, addressing the automation of 3D annotation in the field of autonomous driving. SLF not only predicts detailed 3D shapes but also has good generalization capabilities and efficient annotation performance.