LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors

Sheng Jin,Xueying Jiang,Jiaxing Huang,Lewei Lu,Shijian Lu
2024-02-07
Abstract:Inspired by the outstanding zero-shot capability of vision language models (VLMs) in image classification tasks, open-vocabulary object detection has attracted increasing interest by distilling the broad VLM knowledge into detector training. However, most existing open-vocabulary detectors learn by aligning region embeddings with categorical labels (e.g., bicycle) only, disregarding the capability of VLMs on aligning visual embeddings with fine-grained text description of object parts (e.g., pedals and bells). This paper presents DVDet, a Descriptor-Enhanced Open Vocabulary Detector that introduces conditional context prompts and hierarchical textual descriptors that enable precise region-text alignment as well as open-vocabulary detection training in general. Specifically, the conditional context prompt transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training. In addition, we introduce large language models as an interactive and implicit knowledge repository which enables iterative mining and refining visually oriented textual descriptors for precise region-text alignment. Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem the Paper Aims to Solve This paper aims to address a key issue in Open Vocabulary Object Detection (OVOD): how to leverage the advantages of Visual Language Models (VLMs) to enhance the performance of open vocabulary object detectors. Specifically: 1. **Current Issues**: - Most current open vocabulary object detectors align region embeddings only through category labels (e.g., "bicycle"), ignoring the advantages of VLMs in fine-grained descriptions (e.g., "bell" and "pedal"). - This approach results in poor alignment, especially in terms of fine-grained descriptions. 2. **Proposed Method**: - A new method named DVDet (Descriptor-Enhanced Open Vocabulary Detector) is proposed, which achieves more precise region-text alignment by introducing conditional context prompts and hierarchical text descriptors. - DVDet utilizes Large Language Models (LLMs) as interactive implicit knowledge bases, iteratively generating and optimizing visually relevant fine-grained descriptors to improve region-text alignment. 3. **Main Contributions**: - A feature-level visual prompting technique is designed to transform object embeddings into image-level representations, seamlessly integrating into existing open vocabulary detectors. - A novel hierarchical updating mechanism is proposed, dynamically optimizing region-text alignment through iterative interaction with LLMs. - Extensive experiments demonstrate that this method significantly improves the performance of open vocabulary detection on both base and novel categories. In summary, this paper aims to enhance the overall performance of open vocabulary object detectors by fully leveraging the advantages of VLMs in fine-grained descriptions.