Prompt-Guided DETR with RoI-pruned masked attention for open-vocabulary object detection

Hwanjun Song,Jihwan Bang
DOI: https://doi.org/10.1016/j.patcog.2024.110648
IF: 8
2024-06-13
Pattern Recognition
Abstract:Prompt-OVD is an efficient and effective DETR-based framework for open-vocabulary object detection that utilizes class embeddings from CLIP as prompts, guiding the Transformer decoder to detect objects in base and novel classes. Additionally, our RoI-pruned masked attention helps leverage the zero-shot classification ability of the Vision Transformer-based CLIP, resulting in improved detection performance at a minimal computational cost. Our experiments on the OV-COCO and OV-LVIS datasets demonstrate that Prompt-OVD achieves an impressive 21.2 times faster inference speed than the first end-to-end open-vocabulary detection method (OV-DETR), while also achieving higher APs than four two-stage methods operating within similar inference time ranges. We release the code at https://bit.ly/45qnbs4 .
computer science, artificial intelligence,engineering, electrical & electronic
What problem does this paper attempt to address?