OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer

Yu Wang,Xiangbo Su,Qiang Chen,Xinyu Zhang,Teng Xi,Kun Yao,Errui Ding,Gang Zhang,Jingdong Wang
2024-07-15
Abstract:Open-vocabulary object detection focusing on detecting novel categories guided by natural language. In this report, we propose Open-Vocabulary Light-Weighted Detection Transformer (OVLW-DETR), a deployment friendly open-vocabulary detector with strong performance and low latency. Building upon OVLW-DETR, we provide an end-to-end training recipe that transferring knowledge from vision-language model (VLM) to object detector with simple alignment. We align detector with the text encoder from VLM by replacing the fixed classification layer weights in detector with the class-name embeddings extracted from the text encoder. Without additional fusing module, OVLW-DETR is flexible and deployment friendly, making it easier to implement and modulate. improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time open-vocabulary detectors on standard Zero-Shot LVIS benchmark. Source code and pre-trained models are available at [<a class="link-external link-https" href="https://github.com/Atten4Vis/LW-DETR" rel="external noopener nofollow">this https URL</a>].
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper aims to address the problem of open-vocabulary object detection. Specifically, its goal is to transfer the language knowledge from Vision-Language Models (VLM) to object detectors through a simple alignment method without the need for additional fusion modules, thereby achieving low-latency, high-performance real-time open-vocabulary detection. Traditional object detection methods are usually limited to a predefined set of categories, which restricts their application in the real world. OVLW-DETR (Open-Vocabulary Light-Weighted Detection Transformer) achieves this goal by replacing the classification layer weights in the detector with class name embeddings extracted from the VLM text encoder. This approach not only enhances the model's flexibility but also simplifies the deployment process, making the model easier to implement and adjust. Experimental results show that OVLW-DETR outperforms existing real-time open-vocabulary detectors on the standard Zero-Shot LVIS benchmark.