ZOPP: A Framework of Zero-shot Offboard Panoptic Perception for Autonomous Driving

Tao Ma,Hongbin Zhou,Qiusheng Huang,Xuemeng Yang,Jianfei Guo,Bo Zhang,Min Dou,Yu Qiao,Botian Shi,Hongsheng Li
2024-11-08
Abstract:Offboard perception aims to automatically generate high-quality 3D labels for autonomous driving (AD) scenes. Existing offboard methods focus on 3D object detection with closed-set taxonomy and fail to match human-level recognition capability on the rapidly evolving perception tasks. Due to heavy reliance on human labels and the prevalence of data imbalance and sparsity, a unified framework for offboard auto-labeling various elements in AD scenes that meets the distinct needs of perception tasks is not being fully explored. In this paper, we propose a novel multi-modal Zero-shot Offboard Panoptic Perception (ZOPP) framework for autonomous driving scenes. ZOPP integrates the powerful zero-shot recognition capabilities of vision foundation models and 3D representations derived from point clouds. To the best of our knowledge, ZOPP represents a pioneering effort in the domain of multi-modal panoptic perception and auto labeling for autonomous driving scenes. We conduct comprehensive empirical studies and evaluations on Waymo open dataset to validate the proposed ZOPP on various perception tasks. To further explore the usability and extensibility of our proposed ZOPP, we also conduct experiments in downstream applications. The results further demonstrate the great potential of our ZOPP for real-world scenarios.
Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: in the autonomous driving scenario, the existing offline perception methods rely on a large amount of high - quality manually - annotated data and are mainly focused on the 3D object detection tasks in the closed - set, which cannot meet the requirements of rapidly changing perception tasks. Therefore, the author proposes a novel multi - modal zero - shot offline panoramic perception framework (ZOPP), aiming to achieve high - quality 3D label generation without manual annotation to support multiple perception tasks. ### Specific problems include: 1. **Dependence on manual annotation**: Existing methods require a large amount of high - quality manually - annotated data, which is not only time - consuming but also expensive. 2. **Limitations of closed - set classification**: The existing offline perception methods mainly perform 3D object detection for predefined categories and cannot handle new categories in the open - set. 3. **Data imbalance and sparsity**: The data of small objects or distant objects is very sparse, resulting in poor performance of the automatic annotation model in these cases. 4. **Insufficient cross - domain generalization ability**: Due to the domain differences brought by different types of 3D sensors, existing models are difficult to generalize flexibly. ### Goals of ZOPP: - **Zero - shot recognition**: By combining the powerful zero - shot recognition ability of the visual foundation model and the 3D representation of point clouds, achieve effective recognition on unseen categories. - **Multi - modal input**: Integrate multi - view images and point cloud data to generate robust semantic and instance segmentation results. - **High - precision 3D bounding boxes**: By completing the sparse point clouds, generate accurate 3D bounding boxes, especially for dynamic objects. - **4D occupancy flow prediction**: Use neural rendering technology to reconstruct 3D scenes, decode 4D occupancy flow, and provide more detailed geometric and semantic information. Through these improvements, ZOPP aims to provide a unified and efficient offline automatic annotation framework for the autonomous driving scenario, which can better cope with the diversity and complexity of perception tasks. ### Key formulas and concepts: - **Point cloud alignment**: \[ p_C = R\cdot p_L + t \] where \( p_C \) is the point in the camera coordinate system, \( p_L \) is the point in the LiDAR coordinate system, and \( R \) and \( t \) are the rotation and translation matrices respectively. - **Depth filtering threshold**: \[ \text{if} \left( \frac{\max(p)-\min(p)}{\min(p)} > \theta \right) \] It is used to determine whether background points need to be filtered. - **L - Shape fitting**: \[ \text{Initial Box} = L\text{-Shape fitting}(p_{\text{object}}) \] It is used to generate the initial 3D bounding box. These techniques work together to enable ZOPP to achieve high - quality 3D label generation and support for multiple perception tasks without manual annotation.