Weakly Supervised Open-Vocabulary Object Detection

Jianghang Lin,Yunhang Shen,Bingquan Wang,Shaohui Lin,Ke Li,Liujuan Cao
2023-12-20
Abstract:Despite weakly supervised object detection (WSOD) being a promising step toward evading strong instance-level annotations, its capability is confined to closed-set categories within a single training dataset. In this paper, we propose a novel weakly supervised open-vocabulary object detection framework, namely WSOVOD, to extend traditional WSOD to detect novel concepts and utilize diverse datasets with only image-level annotations. To achieve this, we explore three vital strategies, including dataset-level feature adaptation, image-level salient object localization, and region-level vision-language alignment. First, we perform data-aware feature extraction to produce an input-conditional coefficient, which is leveraged into dataset attribute prototypes to identify dataset bias and help achieve cross-dataset generalization. Second, a customized location-oriented weakly supervised region proposal network is proposed to utilize high-level semantic layouts from the category-agnostic segment anything model to distinguish object boundaries. Lastly, we introduce a proposal-concept synchronized multiple-instance network, i.e., object mining and refinement with visual-semantic alignment, to discover objects matched to the text embeddings of concepts. Extensive experiments on Pascal VOC and MS COCO demonstrate that the proposed WSOVOD achieves new state-of-the-art compared with previous WSOD methods in both close-set object localization and detection tasks. Meanwhile, WSOVOD enables cross-dataset and open-vocabulary learning to achieve on-par or even better performance than well-established fully-supervised open-vocabulary object detection (FSOVOD).
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper aims to address several key issues in Weakly Supervised Open-Vocabulary Object Detection (WSOVOD): 1. **Limitations of Traditional WSOD**: - Existing weakly supervised object detection methods (WSOD) can only handle closed categories within the training dataset and cannot detect new concepts or utilize diverse datasets. - These methods typically rely on a small number of categories from specific datasets, such as 20 categories in Pascal VOC or 80 categories in MS COCO. 2. **Data Bias Issue**: - Different datasets have different data distributions, leading to poor model performance across datasets. - For example, ILSVRC is an object-centric dataset with a balanced category distribution, while LVIS contains many complex scenes and has a long-tailed category distribution. 3. **Object Proposal Generation Issue**: - Existing WSOD methods rely on traditional object proposal generators that can only use low-level features, limiting the model's ability to learn from different semantic levels. - Although some methods attempt to improve object proposal generation using pseudo-label supervision, noisy training makes it difficult to generate high-quality proposals. 4. **Visual-Language Alignment Issue**: - It is challenging to achieve alignment between visual and language representations under weak supervision. - Existing open-vocabulary research usually requires fully supervised methods, needing classification embeddings and box knowledge for visual-language alignment. To address the above issues, the authors propose a novel weakly supervised open-vocabulary object detection framework, WSOVOD, which achieves this goal through the following three main strategies: 1. **Data-Aware Feature Extraction**: - Extract data-aware features under input conditions and combine dataset attribute prototypes to identify dataset biases in proposal features of different distributions. - Learn global image features through an additional branch, generating channel-level global vectors as coefficients to recalibrate the final proposal features, enhancing the model's generalization ability across different scenes and categories. 2. **Location-Oriented Region Proposal Network**: - Utilize the knowledge of the high-precision image segmentation model SAM to design a location-oriented weakly supervised region proposal network (Location-Oriented Weakly Supervised Region Proposal Network, LOWSRPN) to identify potential object boundaries. - Enhance high-quality object proposals by combining additional proposals generated by SAM and use high-precision semantic layouts to distinguish object boundaries. 3. **Proposal-Concept Synchronized Multi-Instance Network**: - Introduce a proposal-concept synchronized multi-instance network to discover potential objects through image-level classification embeddings and gradually align visual and language representations. - Use a pre-trained text encoder to obtain text embeddings of target vocabulary, treating them as category prototypes in multi-instance learning to further align object and concept representations. Through these strategies, WSOVOD achieves significantly better performance than existing methods on the Pascal VOC and MS COCO datasets and performs excellently on unseen new categories, even surpassing fully supervised open-vocabulary object detection methods.