How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection

Yiyang Yao,Peng Liu,Tiancheng Zhao,Qianqian Zhang,Jiajia Liao,Chunxin Fang,Kyusong Lee,Qing Wang
2023-12-18
Abstract:Object detection (OD) in computer vision has made significant progress in recent years, transitioning from closed-set labels to open-vocabulary detection (OVD) based on large-scale vision-language pre-training (VLP). However, current evaluation methods and datasets are limited to testing generalization over object types and referral expressions, which do not provide a systematic, fine-grained, and accurate benchmark of OVD models' abilities. In this paper, we propose a new benchmark named OVDEval, which includes 9 sub-tasks and introduces evaluations on commonsense knowledge, attribute understanding, position understanding, object relation comprehension, and more. The dataset is meticulously created to provide hard negatives that challenge models' true understanding of visual and linguistic input. Additionally, we identify a problem with the popular Average Precision (AP) metric when benchmarking models on these fine-grained label datasets and propose a new metric called Non-Maximum Suppression Average Precision (NMS-AP) to address this issue. Extensive experimental results show that existing top OVD models all fail on the new tasks except for simple object types, demonstrating the value of the proposed dataset in pinpointing the weakness of current OVD models and guiding future research. Furthermore, the proposed NMS-AP metric is verified by experiments to provide a much more truthful evaluation of OVD models, whereas traditional AP metrics yield deceptive results. Data is available at \url{<a class="link-external link-https" href="https://github.com/om-ai-lab/OVDEval" rel="external noopener nofollow">this https URL</a>}
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problems that this paper attempts to solve mainly focus on the evaluation of Open - Vocabulary Detection (OVD) models. Specifically, current evaluation methods and datasets have limitations in testing the generalization ability of OVD models for object types and referring expressions, and fail to provide a systematic, detailed and accurate benchmark to measure the capabilities of OVD models. The paper points out that existing evaluation methods lack systematic exploration of the model's capabilities in common - sense knowledge, attribute understanding, location understanding, object - relationship understanding, etc., and lack difficult negative samples that can challenge the model's true understanding and ability to distinguish visual and language inputs. To solve these problems, the paper proposes a new benchmark dataset - OVDEval, which includes 9 subtasks covering 6 linguistic aspects: objects, proper nouns, attributes, locations, relationships and negations. These sub - datasets are designed to provide difficult negative samples to challenge the model's true understanding ability. In addition, the paper also proposes a new evaluation metric - Non - Maximum Suppression Mean Average Precision (NMS - AP), which is used to solve the "inflated AP problem" existing in the traditional Mean Average Precision (AP) metric on fine - grained label datasets. That is, the model can obtain a high AP score by predicting multiple bounding boxes even if it does not truly understand the described content. By introducing NMS - AP, the paper provides a more realistic OVD model evaluation method. The experimental results show that the existing top - level OVD models generally perform poorly on the new tasks, and the performance of other tasks except for simple object - type recognition has decreased significantly. This highlights the value of the proposed OVDEval dataset in revealing the weaknesses of current OVD models and guiding future research directions. At the same time, the proposed NMS - AP metric has also been verified to be able to more realistically evaluate the performance of OVD models, while the traditional AP metric may produce misleading results.