Abstract:Open-vocabulary detection (OVD) aims to detect novel objects without instance-level annotations to achieve open-world object detection at a lower cost. Existing OVD methods mainly rely on the powerful open-vocabulary image-text alignment capability of Vision-Language Pretrained Models (VLM) such as CLIP. However, CLIP is trained on image-text pairs and lacks the perceptual ability for local regions within an image, resulting in the gap between image and region representations. Directly using CLIP for OVD causes inaccurate region classification. We find the image-region gap is primarily caused by the deformation of region feature maps during region of interest (RoI) extraction. To mitigate the inaccurate region classification in OVD, we propose a new Shape-Invariant Adapter named SIA-OVD to bridge the image-region gap in the OVD task. SIA-OVD learns a set of feature adapters for regions with different shapes and designs a new adapter allocation mechanism to select the optimal adapter for each region. The adapted region representations can align better with text representations learned by CLIP. Extensive experiments demonstrate that SIA-OVD effectively improves the classification accuracy for regions by addressing the gap between images and regions caused by shape deformation. SIA-OVD achieves substantial improvements over representative methods on the COCO-OVD benchmark. The code is available at <a class="link-external link-https" href="https://github.com/PKU-ICST-MIPL/SIA-OVD_ACMMM2024" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

### Problems Addressed by the Paper This paper attempts to address the image-region gap problem in the Open-Vocabulary Detection (OVD) task. Specifically, existing OVD methods mainly rely on powerful vision-language pre-trained models (such as CLIP), which are trained on image-text pairs but lack the ability to perceive local regions, leading to a gap between image and region representations. Directly using CLIP for OVD results in inaccurate region classification. ### Background and Challenges 1. **Open-Vocabulary Detection (OVD)**: - OVD aims to detect new objects without instance-level annotations, enabling low-cost open-world object detection. - Existing OVD methods mainly rely on vision-language pre-trained models (such as CLIP), which are pre-trained on large-scale image-text datasets and have strong open-vocabulary image-text alignment capabilities. 2. **Image-Region Gap**: - CLIP is trained on image-text pairs and lacks the ability to perceive local regions within images. - Directly using CLIP for OVD results in inaccurate region classification because CLIP deforms the region feature map when extracting regions of interest (RoI), leading to an image-region gap. ### Solution To address this issue, the authors propose a new Shape-Invariant Adapter (SIA-OVD) to bridge the image-region gap in the OVD task. The specific contributions are as follows: 1. **Shape-Invariant Adapter (SIA)**: - SIA learns a set of feature adapters suitable for regions of different shapes and designs a new adapter assignment mechanism to select the best adapter for each region. - By providing shape-invariant region feature representations, SIA can better align with the text representations learned by CLIP, thereby improving the accuracy of region classification. 2. **Adapter Assignment Mechanism**: - This mechanism adjusts the weights of different adapters based on the shape of the object, allowing each adapter to handle regions of similar shapes, thereby reducing the image-region gap caused by shape variations. 3. **Experimental Validation**: - The authors conducted extensive experiments on the COCO-OVD benchmark, showing that SIA-OVD effectively improves the accuracy of region classification, especially when dealing with regions of extreme aspect ratios. - SIA-OVD outperforms representative methods on the COCO-OVD benchmark, achieving significant improvements in both open-vocabulary detection and region classification tasks. ### Conclusion By proposing the Shape-Invariant Adapter (SIA), this paper successfully addresses the image-region gap problem in the OVD task, improving the accuracy of region classification. SIA allows the direct application of CLIP's open-vocabulary knowledge to the OVD task without fine-tuning the CLIP image encoder parameters, demonstrating stronger robustness and adaptability in detecting new objects.

SIA-OVD: Shape-Invariant Adapter for Bridging the Image-Region Gap in Open-Vocabulary Detection

SLV: Spatial Likelihood Voting for Weakly Supervised Object Detection

Spatial Likelihood Voting with Self-Knowledge Distillation for Weakly Supervised Object Detection.

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection

OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

Aligning Bag of Regions for Open-Vocabulary Object Detection

Open-Vocabulary Object Detection via Neighboring Region Attention Alignment

CORA: Adapting CLIP for Open-Vocabulary Detection with Region Prompting and Anchor Pre-Matching

Simple Image-level Classification Improves Open-vocabulary Object Detection

Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation

Unified Embedding Alignment for Open-Vocabulary Video Instance Segmentation

Optimization Efficient Open-World Visual Region Recognition

OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation

OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning

Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models

Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

Learning Object-Language Alignments for Open-Vocabulary Object Detection

Open-Vocabulary Object Detection with an Open Corpus