SIA-OVD: Shape-Invariant Adapter for Bridging the Image-Region Gap in Open-Vocabulary Detection

Zishuo Wang,Wenhao Zhou,Jinglin Xu,Yuxin Peng
DOI: https://doi.org/10.1145/3664647.3680642
2024-10-08
Abstract:Open-vocabulary detection (OVD) aims to detect novel objects without instance-level annotations to achieve open-world object detection at a lower cost. Existing OVD methods mainly rely on the powerful open-vocabulary image-text alignment capability of Vision-Language Pretrained Models (VLM) such as CLIP. However, CLIP is trained on image-text pairs and lacks the perceptual ability for local regions within an image, resulting in the gap between image and region representations. Directly using CLIP for OVD causes inaccurate region classification. We find the image-region gap is primarily caused by the deformation of region feature maps during region of interest (RoI) extraction. To mitigate the inaccurate region classification in OVD, we propose a new Shape-Invariant Adapter named SIA-OVD to bridge the image-region gap in the OVD task. SIA-OVD learns a set of feature adapters for regions with different shapes and designs a new adapter allocation mechanism to select the optimal adapter for each region. The adapted region representations can align better with text representations learned by CLIP. Extensive experiments demonstrate that SIA-OVD effectively improves the classification accuracy for regions by addressing the gap between images and regions caused by shape deformation. SIA-OVD achieves substantial improvements over representative methods on the COCO-OVD benchmark. The code is available at <a class="link-external link-https" href="https://github.com/PKU-ICST-MIPL/SIA-OVD_ACMMM2024" rel="external noopener nofollow">this https URL</a>.
Computer Vision and Pattern Recognition,Multimedia
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper attempts to address the image-region gap problem in the Open-Vocabulary Detection (OVD) task. Specifically, existing OVD methods mainly rely on powerful vision-language pre-trained models (such as CLIP), which are trained on image-text pairs but lack the ability to perceive local regions, leading to a gap between image and region representations. Directly using CLIP for OVD results in inaccurate region classification. ### Background and Challenges 1. **Open-Vocabulary Detection (OVD)**: - OVD aims to detect new objects without instance-level annotations, enabling low-cost open-world object detection. - Existing OVD methods mainly rely on vision-language pre-trained models (such as CLIP), which are pre-trained on large-scale image-text datasets and have strong open-vocabulary image-text alignment capabilities. 2. **Image-Region Gap**: - CLIP is trained on image-text pairs and lacks the ability to perceive local regions within images. - Directly using CLIP for OVD results in inaccurate region classification because CLIP deforms the region feature map when extracting regions of interest (RoI), leading to an image-region gap. ### Solution To address this issue, the authors propose a new Shape-Invariant Adapter (SIA-OVD) to bridge the image-region gap in the OVD task. The specific contributions are as follows: 1. **Shape-Invariant Adapter (SIA)**: - SIA learns a set of feature adapters suitable for regions of different shapes and designs a new adapter assignment mechanism to select the best adapter for each region. - By providing shape-invariant region feature representations, SIA can better align with the text representations learned by CLIP, thereby improving the accuracy of region classification. 2. **Adapter Assignment Mechanism**: - This mechanism adjusts the weights of different adapters based on the shape of the object, allowing each adapter to handle regions of similar shapes, thereby reducing the image-region gap caused by shape variations. 3. **Experimental Validation**: - The authors conducted extensive experiments on the COCO-OVD benchmark, showing that SIA-OVD effectively improves the accuracy of region classification, especially when dealing with regions of extreme aspect ratios. - SIA-OVD outperforms representative methods on the COCO-OVD benchmark, achieving significant improvements in both open-vocabulary detection and region classification tasks. ### Conclusion By proposing the Shape-Invariant Adapter (SIA), this paper successfully addresses the image-region gap problem in the OVD task, improving the accuracy of region classification. SIA allows the direct application of CLIP's open-vocabulary knowledge to the OVD task without fine-tuning the CLIP image encoder parameters, demonstrating stronger robustness and adaptability in detecting new objects.