Abstract:Object detection (OD) in computer vision has made significant progress in recent years, transitioning from closed-set labels to open-vocabulary detection (OVD) based on large-scale vision-language pre-training (VLP). However, current evaluation methods and datasets are limited to testing generalization over object types and referral expressions, which do not provide a systematic, fine-grained, and accurate benchmark of OVD models' abilities. In this paper, we propose a new benchmark named OVDEval, which includes 9 sub-tasks and introduces evaluations on commonsense knowledge, attribute understanding, position understanding, object relation comprehension, and more. The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input. Additionally, we identify a problem with the popular Average Precision (AP) metric when benchmarking models on these fine-grained label datasets and propose a new metric called Non-Maximum Suppression Average Precision (NMS-AP) to address this issue. Extensive experimental results show that existing top OVD models all fail on the new tasks except for simple object types, demonstrating the value of the proposed dataset in pinpointing the weakness of current OVD models and guiding future research. Furthermore, the proposed NMS-AP metric is verified by experiments to provide a much more truthful evaluation of OVD models, whereas traditional AP metrics yield deceptive results. Data is available at \url{<a class="link-external link-https" href="https://github.com/om-ai-lab/OVDEval" rel="external noopener nofollow">this https URL</a>}

What problem does this paper attempt to address?

The problems that this paper attempts to solve mainly focus on the evaluation of Open - Vocabulary Detection (OVD) models. Specifically, current evaluation methods and datasets have limitations in testing the generalization ability of OVD models for object types and referring expressions, and fail to provide a systematic, detailed and accurate benchmark to measure the capabilities of OVD models. The paper points out that existing evaluation methods lack systematic exploration of the model's capabilities in common - sense knowledge, attribute understanding, location understanding, object - relationship understanding, etc., and lack difficult negative samples that can challenge the model's true understanding and ability to distinguish visual and language inputs. To solve these problems, the paper proposes a new benchmark dataset - OVDEval, which includes 9 subtasks covering 6 linguistic aspects: objects, proper nouns, attributes, locations, relationships and negations. These sub - datasets are designed to provide difficult negative samples to challenge the model's true understanding ability. In addition, the paper also proposes a new evaluation metric - Non - Maximum Suppression Mean Average Precision (NMS - AP), which is used to solve the "inflated AP problem" existing in the traditional Mean Average Precision (AP) metric on fine - grained label datasets. That is, the model can obtain a high AP score by predicting multiple bounding boxes even if it does not truly understand the described content. By introducing NMS - AP, the paper provides a more realistic OVD model evaluation method. The experimental results show that the existing top - level OVD models generally perform poorly on the new tasks, and the performance of other tasks except for simple object - type recognition has decreased significantly. This highlights the value of the proposed OVDEval dataset in revealing the weaknesses of current OVD models and guiding future research directions. At the same time, the proposed NMS - AP metric has also been verified to be able to more realistically evaluate the performance of OVD models, while the traditional AP metric may produce misleading results.

How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

Open-Vocabulary Object Detection with an Open Corpus

OpenOOD v1.5: Enhanced Benchmark for Out-of-Distribution Detection

The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding

On the Potential of Open-Vocabulary Models for Object Detection in Unusual Street Scenes

OV-VG: A benchmark for open-vocabulary visual grounding

DetectBench: An Object Detection Benchmark for OOD Generalization Algorithms

A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future

Open-vocabulary Attribute Detection

Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

Open-Vocabulary Video Anomaly Detection

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

Open-Vocabulary Object Detectors: Robustness Challenges under Distribution Shifts

OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

Open-set object detection: towards unified problem formulation and benchmarking

Open-Vocabulary Object Detection with Meta Prompt Representation and Instance Contrastive Optimization

Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey

OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation