OV-DAR: Open-Vocabulary Object Detection and Attributes Recognition
DOI: https://doi.org/10.1007/s11263-024-02144-1
IF: 13.369
2024-06-14
International Journal of Computer Vision
Abstract:In this paper, we endeavor to localize all potential objects in an image and infer their visual categories, attributes, and shapes, even in instances where certain objects have not been encompassed in the model's supervised training. This is similar to the challenge posed by open-vocabulary object detection and recognition. The proposed OV-DAR framework, in contrast to previous object detection and recognition frameworks, offers superior advantages and performance in terms of generalization, universality, and granularity expression. Specifically, OV-DAR disentangles the open-vocabulary object detection and recognition problem into two components: class-agnostic object proposal and open-vocabulary classification. It employs co-training to maintain a balance between the performance of these two components. For the former, we construct class-agnostic object proposal networks based on the anchor/query with the SAM foundation model, which demonstrates robust generalization in object proposing and masking. For the latter, we merge available object-centered category classification and attribute prediction data, take co-learning for efficient fine-tuning of CLIP, and subsequently augment the open-vocabulary capability on object-centered category/attribute prediction tasks using freely accessible online image–text pairs. To ensure the efficiency and accuracy of open-vocabulary classification, we devise a structure akin to Faster R-CNN and fully exploit the knowledge of object-centered CLIP for end-to-end multi-object open-vocabulary category and attribute prediction by knowledge distillation. We conduct comprehensive experiments on VAW, MS-COCO, LSA, and OVAD datasets. The results not only illustrate the complementarity of semantic category and attribute recognition for visual scene understanding but also underscore the generalization capability of OV-DAR in localizing, categorizing, attributing, and masking tasks and open-world scene perception.
computer science, artificial intelligence