Abstract:Despite weakly supervised object detection (WSOD) being a promising step toward evading strong instance-level annotations, its capability is confined to closed-set categories within a single training dataset. In this paper, we propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD to detect novel concepts and utilize diverse datasets with only image-level annotations. To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment. First, we perform data-aware feature extraction to produce an input-conditional coefficient, which is leveraged into dataset attribute prototypes to identify dataset bias and help achieve cross-dataset generalization. Second, a customized location-oriented weakly supervised region proposal network is proposed to utilize high-level semantic layouts from the category-agnostic segment anything model to distinguish object boundaries. Lastly, we introduce a proposal-concept synchronized multiple-instance network, i.e., object mining and refinement with visual-semantic alignment, to discover objects matched to the text embeddings of concepts. Extensive experiments on Pascal VOC and MS COCO demonstrate that the proposed WSOVOD achieves new state-of-the-art compared with previous WSOD methods in both close-set object localization and detection tasks. Meanwhile, WSOVOD enables cross-dataset and open-vocabulary learning to achieve on-par or even better performance than well-established fully-supervised open-vocabulary object detection (FSOVOD).

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper aims to address several key issues in Weakly Supervised Open-Vocabulary Object Detection (WSOVOD): 1. **Limitations of Traditional WSOD**: - Existing weakly supervised object detection methods (WSOD) can only handle closed categories within the training dataset and cannot detect new concepts or utilize diverse datasets. - These methods typically rely on a small number of categories from specific datasets, such as 20 categories in Pascal VOC or 80 categories in MS COCO. 2. **Data Bias Issue**: - Different datasets have different data distributions, leading to poor model performance across datasets. - For example, ILSVRC is an object-centric dataset with a balanced category distribution, while LVIS contains many complex scenes and has a long-tailed category distribution. 3. **Object Proposal Generation Issue**: - Existing WSOD methods rely on traditional object proposal generators that can only use low-level features, limiting the model's ability to learn from different semantic levels. - Although some methods attempt to improve object proposal generation using pseudo-label supervision, noisy training makes it difficult to generate high-quality proposals. 4. **Visual-Language Alignment Issue**: - It is challenging to achieve alignment between visual and language representations under weak supervision. - Existing open-vocabulary research usually requires fully supervised methods, needing classification embeddings and box knowledge for visual-language alignment. To address the above issues, the authors propose a novel weakly supervised open-vocabulary object detection framework, WSOVOD, which achieves this goal through the following three main strategies: 1. **Data-Aware Feature Extraction**: - Extract data-aware features under input conditions and combine dataset attribute prototypes to identify dataset biases in proposal features of different distributions. - Learn global image features through an additional branch, generating channel-level global vectors as coefficients to recalibrate the final proposal features, enhancing the model's generalization ability across different scenes and categories. 2. **Location-Oriented Region Proposal Network**: - Utilize the knowledge of the high-precision image segmentation model SAM to design a location-oriented weakly supervised region proposal network (Location-Oriented Weakly Supervised Region Proposal Network, LOWSRPN) to identify potential object boundaries. - Enhance high-quality object proposals by combining additional proposals generated by SAM and use high-precision semantic layouts to distinguish object boundaries. 3. **Proposal-Concept Synchronized Multi-Instance Network**: - Introduce a proposal-concept synchronized multi-instance network to discover potential objects through image-level classification embeddings and gradually align visual and language representations. - Use a pre-trained text encoder to obtain text embeddings of target vocabulary, treating them as category prototypes in multi-instance learning to further align object and concept representations. Through these strategies, WSOVOD achieves significantly better performance than existing methods on the Pascal VOC and MS COCO datasets and performs excellently on unseen new categories, even surpassing fully supervised open-vocabulary object detection methods.

Weakly Supervised Open-Vocabulary Object Detection

Spatial Likelihood Voting with Self-Knowledge Distillation for Weakly Supervised Object Detection.

SLV: Spatial Likelihood Voting for Weakly Supervised Object Detection

UWSOD: Toward Fully-Supervised-Level Capacity Weakly Supervised Object Detection.

HUWSOD: Holistic Self-training for Unified Weakly Supervised Object Detection

An adaptive learning-based weakly supervised object detection via context awareness

Deep Learning for Weakly-Supervised Object Detection and Localization: A Survey

Self-Training-Based Semantic-Balanced Network for Weakly Supervised Object Detection in Remote-Sensing Images

Recurrent Self-Optimizing Proposals for Weakly Supervised Object Detection

A Dual-Network Progressive Approach to Weakly Supervised Object Detection.

Enabling Deep Residual Networks for Weakly Supervised Object Detection

PCL: Proposal Cluster Learning for Weakly Supervised Object Detection

Weakly Supervised Object Detection with Symmetry Context

Weakly-semi-supervised object detection in remotely sensed imagery

MOL: Towards Accurate Weakly Supervised Remote Sensing Object Detection Via Multi-view Noisy Learning

Weakly Supervised Object Detection for Remote Sensing Images via Progressive Image-Level and Instance-Level Feature Refinement

Open-Vocabulary Object Detection with an Open Corpus

Open-World Weakly-Supervised Object Localization.

High-Quality Proposals for Weakly Supervised Object Detection.

W2N: Switching from Weak Supervision to Noisy Supervision for Object Detection.

Salvage of Supervision in Weakly Supervised Object Detection