OpenAD: Open-World Autonomous Driving Benchmark for 3D Object Detection

Zhongyu Xia,Jishuo Li,Zhiwei Lin,Xinhao Wang,Yongtao Wang,Ming-Hsuan Yang
2024-11-26
Abstract:Open-world autonomous driving encompasses domain generalization and open-vocabulary. Domain generalization refers to the capabilities of autonomous driving systems across different scenarios and sensor parameter configurations. Open vocabulary pertains to the ability to recognize various semantic categories not encountered during training. In this paper, we introduce OpenAD, the first real-world open-world autonomous driving benchmark for 3D object detection. OpenAD is built on a corner case discovery and annotation pipeline integrating with a multimodal large language model (MLLM). The proposed pipeline annotates corner case objects in a unified format for five autonomous driving perception datasets with 2000 scenarios. In addition, we devise evaluation methodologies and evaluate various 2D and 3D open-world and specialized models. Moreover, we propose a vision-centric 3D open-world object detection baseline and further introduce an ensemble method by fusing general and specialized models to address the issue of lower precision in existing open-world methods for the OpenAD benchmark. Annotations, toolkit code, and all evaluation codes will be released.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the 3D object detection challenges in open - world autonomous driving. Specifically, the paper focuses on two key issues: **domain generalization** and **open - vocabulary**: 1. **Domain generalization**: It refers to the performance of the autonomous driving system under different scenarios and sensor parameter configurations. Existing models perform poorly when encountering unseen scenarios, which limits their reliability and robustness in practical applications. 2. **Open - vocabulary**: It refers to the model's ability to recognize semantic categories that have not been encountered during the training process. This is crucial for subsequent reasoning and planning, such as determining whether an object is collidable, whether it will move suddenly, or whether it indicates that certain areas are impassable. To solve these problems, the authors propose **OpenAD** - an open - world autonomous driving benchmark dataset for 3D object detection. The main features of OpenAD include: - **Richly annotated data**: It contains 2,000 scenes from five autonomous driving perception datasets, with thousands of corner - case objects annotated. - **Multi - modal large language model (MLLM) - integrated annotation pipeline**: It is used for automatically identifying and annotating corner - case objects. - **Evaluation methods**: New evaluation metrics are designed to comprehensively evaluate the model's domain generalization ability and open - vocabulary ability. Through these efforts, OpenAD aims to fill the gaps in existing 3D perception datasets and provide a more comprehensive and challenging benchmark to promote the development of open - world autonomous driving technology. ### Formula summary The formulas involved in the paper are mainly used for the calculation of evaluation metrics, such as: - Calculation of **Average Precision (AP)** and **Average Recall (AR)**: \[ \text{AP}=\frac{\sum_{i = 1}^{N}\text{TP}_i}{\sum_{i = 1}^{N}(\text{TP}_i+\text{FP}_i)} \] \[ \text{AR}=\frac{\sum_{i = 1}^{N}\text{TP}_i}{\sum_{i = 1}^{N}(\text{TP}_i+\text{FN}_i)} \] where $\text{TP}$ represents true positive, $\text{FP}$ represents false positive, and $\text{FN}$ represents false negative. - **Position threshold and semantic similarity threshold**: - For 2D object detection, the Intersection over Union (IoU) is used as the position score, and the cosine similarity is used as the semantic score. - For 3D object detection, the center distance is used as the position score, and the cosine similarity is also used as the semantic score. These formulas ensure a comprehensive evaluation of the model's performance, especially its performance when dealing with unseen categories and scenarios.