Abstract:Open-Vocabulary Object Detection (OVOD) aims to detect novel objects beyond a given set of base categories on which the detection model is trained. Recent OVOD methods focus on adapting the image-level pre-trained vision-language models (VLMs), such as CLIP, to a region-level object detection task via, eg., region-level knowledge distillation, regional prompt learning, or region-text pre-training, to expand the detection vocabulary. These methods have demonstrated remarkable performance in recognizing regional visual concepts, but they are weak in exploiting the VLMs' powerful global scene understanding ability learned from the billion-scale image-level text descriptions. This limits their capability in detecting hard objects of small, blurred, or occluded appearance from novel/base categories, whose detection heavily relies on contextual information. To address this, we propose a novel approach, namely Simple Image-level Classification for Context-Aware Detection Scoring (SIC-CADS), to leverage the superior global knowledge yielded from CLIP for complementing the current OVOD models from a global perspective. The core of SIC-CADS is a multi-modal multi-label recognition (MLR) module that learns the object co-occurrence-based contextual information from CLIP to recognize all possible object categories in the scene. These image-level MLR scores can then be utilized to refine the instance-level detection scores of the current OVOD models in detecting those hard objects. This is verified by extensive empirical results on two popular benchmarks, OV-LVIS and OV-COCO, which show that SIC-CADS achieves significant and consistent improvement when combined with different types of OVOD models. Further, SIC-CADS also improves the cross-dataset generalization ability on Objects365 and OpenImages. The code is available at <a class="link-external link-https" href="https://github.com/mala-lab/SIC-CADS" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper aims to address a key issue in Open Vocabulary Object Detection (OVOD), which is how to better leverage the global scene understanding capabilities of Vision-Language Models (VLM) to improve the detection performance of small, blurry, or occluded objects. #### Specific Issues: 1. **Limitations of Existing Methods**: Current OVOD methods mainly focus on the detection of region-level visual concepts, but these methods fail to fully utilize the powerful global scene understanding capabilities of VLM learned from large-scale image-text data. 2. **Difficulty in Detecting Small Objects**: For small, blurry, or occluded objects, existing OVOD methods perform poorly due to the lack of contextual information support. 3. **Insufficient Cross-Dataset Generalization Ability**: Existing methods have limited generalization ability across different datasets, especially when detecting new categories. #### Solution: - A new method called Simple Image-level Classification for Context-Aware Detection Scoring (SIC-CADS) is proposed, which leverages the global knowledge of CLIP through a Multi-modal Multi-label Recognition (MLR) module. - The MLR module learns image-level multi-modal knowledge, enabling it to recognize all possible object categories present in an image and use these scores to optimize the instance-level detection scores of existing OVOD models. - Experimental results show that SIC-CADS significantly improves the performance of OVOD models on multiple benchmarks, particularly excelling in detecting small, blurry, or occluded objects.

Simple Image-level Classification Improves Open-vocabulary Object Detection

Spatial Likelihood Voting with Self-Knowledge Distillation for Weakly Supervised Object Detection.

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

Multi-Modal Classifiers for Open-Vocabulary Object Detection

Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

OV-DAR: Open-Vocabulary Object Detection and Attributes Recognition

Open-Vocabulary Object Detection using Pseudo Caption Labels

Open-Vocabulary Object Detection with Meta Prompt Representation and Instance Contrastive Optimization

Boosting Open-Vocabulary Object Detection by Handling Background Samples

Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

Open-Vocabulary Camouflaged Object Segmentation

Learning Object-Language Alignments for Open-Vocabulary Object Detection

Open-Vocabulary Object Detection with an Open Corpus

LP-OVOD: Open-Vocabulary Object Detection by Linear Probing

Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition

TagCLIP: Improving Discrimination Ability of Open-Vocabulary Semantic Segmentation

LOVD: Large-and-Open Vocabulary Object Detection

CASA: Class-Agnostic Shared Attributes in Vision-Language Models for Efficient Incremental Object Detection

Sampling Bag of Views for Open-Vocabulary Object Detection