LOVD: Large-and-Open Vocabulary Object Detection

Shiyu Tang,Zhaofan Luo,Yifan Wang,Lijun Wang,Huchuan Lu,Weibo Su,Libo Liu
DOI: https://doi.org/10.1145/3664647.3680925
2024-01-01
Abstract:Existing open-vocabulary object detectors require an accurate and compact vocabulary pre-defined during inference. Their performance is largely degraded in real scenarios where the underlying vocabulary may be indeterminate and often exponentially large. To have a more comprehensive understanding of this phenomenon, we propose a new setting called Large-and-Open Vocabulary object Detection, which simulates real scenarios by testing detectors with large vocabularies containing thousands of unseen categories. The vast unseen categories inevitably lead to an increase in category distractors, severely impeding the recognition process and leading to unsatisfactory detection results. To address this challenge, We propose a Large and Open Vocabulary Detector (LOVD) with two core components, termed the Image-to-Region Filtering (IRF) module and Cross-View Verification (CV2) scheme. To relieve the category distractors of the given large vocabularies, IRF performs image-level recognition to build a compact vocabulary relevant to the image scene out of the large input vocabulary, followed by region-level classification upon the compact vocabulary. CV2 further enhances the IRF by conducting image-to-region filtering in both global and local views and produces the final detection categories through a two-branch voting mechanism. Compared to the prior works, our LOVD is more scalable and robust to large input vocabularies, and can be seamlessly integrated with predominant detection methods to improve their open-vocabulary performance. The code can be found at https://github.com/Altria-luo/LOVD.
What problem does this paper attempt to address?