Abstract:In this paper, we endeavor to localize all potential objects in an image and infer their visual categories, attributes, and shapes, even in instances where certain objects have not been encompassed in the model's supervised training. This is similar to the challenge posed by open-vocabulary object detection and recognition. The proposed OV-DAR framework, in contrast to previous object detection and recognition frameworks, offers superior advantages and performance in terms of generalization, universality, and granularity expression. Specifically, OV-DAR disentangles the open-vocabulary object detection and recognition problem into two components: class-agnostic object proposal and open-vocabulary classification. It employs co-training to maintain a balance between the performance of these two components. For the former, we construct class-agnostic object proposal networks based on the anchor/query with the SAM foundation model, which demonstrates robust generalization in object proposing and masking. For the latter, we merge available object-centered category classification and attribute prediction data, take co-learning for efficient fine-tuning of CLIP, and subsequently augment the open-vocabulary capability on object-centered category/attribute prediction tasks using freely accessible online image–text pairs. To ensure the efficiency and accuracy of open-vocabulary classification, we devise a structure akin to Faster R-CNN and fully exploit the knowledge of object-centered CLIP for end-to-end multi-object open-vocabulary category and attribute prediction by knowledge distillation. We conduct comprehensive experiments on VAW, MS-COCO, LSA, and OVAD datasets. The results not only illustrate the complementarity of semantic category and attribute recognition for visual scene understanding but also underscore the generalization capability of OV-DAR in localizing, categorizing, attributing, and masking tasks and open-world scene perception.

LOVD: Large-and-Open Vocabulary Object Detection

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

Universal Object Detection with Large Vision Model

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

Learning Object-Language Alignments for Open-Vocabulary Object Detection

Retrieval-Augmented Open-Vocabulary Object Detection

OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer

LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

OV-DAR: Open-Vocabulary Object Detection and Attributes Recognition

Open-Vocabulary Object Detection with an Open Corpus

Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection

LP-OVOD: Open-Vocabulary Object Detection by Linear Probing

The devil is in the fine-grained details: Evaluating open-vocabulary object detectors for fine-grained understanding

OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision

Multi-Modal Classifiers for Open-Vocabulary Object Detection

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

YOLO-World: Real-Time Open-Vocabulary Object Detection

Towards Open Vocabulary Learning: A Survey

OVA-DETR: Open Vocabulary Aerial Object Detection Using Image-Text Alignment and Fusion

OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion

Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection