LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors

Sheng Jin,Xueying Jiang,Jiaxing Huang,Lewei Lu,Shijian Lu

2024-02-07

Abstract:Inspired by the outstanding zero-shot capability of vision language models (VLMs) in image classification tasks, open-vocabulary object detection has attracted increasing interest by distilling the broad VLM knowledge into detector training. However, most existing open-vocabulary detectors learn by aligning region embeddings with categorical labels (e.g., bicycle) only, disregarding the capability of VLMs on aligning visual embeddings with fine-grained text description of object parts (e.g., pedals and bells). This paper presents DVDet, a Descriptor-Enhanced Open Vocabulary Detector that introduces conditional context prompts and hierarchical textual descriptors that enable precise region-text alignment as well as open-vocabulary detection training in general. Specifically, the conditional context prompt transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training. In addition, we introduce large language models as an interactive and implicit knowledge repository which enables iterative mining and refining visually oriented textual descriptors for precise region-text alignment. Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

### The Problem the Paper Aims to Solve This paper aims to address a key issue in Open Vocabulary Object Detection (OVOD): how to leverage the advantages of Visual Language Models (VLMs) to enhance the performance of open vocabulary object detectors. Specifically: 1. **Current Issues**: - Most current open vocabulary object detectors align region embeddings only through category labels (e.g., "bicycle"), ignoring the advantages of VLMs in fine-grained descriptions (e.g., "bell" and "pedal"). - This approach results in poor alignment, especially in terms of fine-grained descriptions. 2. **Proposed Method**: - A new method named DVDet (Descriptor-Enhanced Open Vocabulary Detector) is proposed, which achieves more precise region-text alignment by introducing conditional context prompts and hierarchical text descriptors. - DVDet utilizes Large Language Models (LLMs) as interactive implicit knowledge bases, iteratively generating and optimizing visually relevant fine-grained descriptors to improve region-text alignment. 3. **Main Contributions**: - A feature-level visual prompting technique is designed to transform object embeddings into image-level representations, seamlessly integrating into existing open vocabulary detectors. - A novel hierarchical updating mechanism is proposed, dynamically optimizing region-text alignment through iterative interaction with LLMs. - Extensive experiments demonstrate that this method significantly improves the performance of open vocabulary detection on both base and novel categories. In summary, this paper aims to enhance the overall performance of open vocabulary object detectors by fully leveraging the advantages of VLMs in fine-grained descriptions.

LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors

Spatial Likelihood Voting with Self-Knowledge Distillation for Weakly Supervised Object Detection.

LaMI-DETR: Open-Vocabulary Detection with Language Model Instruction

Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Unlocking Textual and Visual Wisdom: Open-Vocabulary 3D Object Detection Enhanced by Comprehensive Guidance from Text and Image

Learning Object-Language Alignments for Open-Vocabulary Object Detection

LOVD: Large-and-Open Vocabulary Object Detection

What Makes Good Open-Vocabulary Detector: A Disassembling Perspective

F-VLM: Open-Vocabulary Object Detection upon Frozen Vision and Language Models

MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection

DenseVLM: A Retrieval and Decoupled Alignment Framework for Open-Vocabulary Dense Prediction

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object Detection

OVLW-DETR: Open-Vocabulary Light-Weighted Detection Transformer

P$^3$OVD: Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection

Aligning Bag of Regions for Open-Vocabulary Object Detection

Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection

From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

HA-FGOVD: Highlighting Fine-grained Attributes via Explicit Linear Composition for Open-Vocabulary Object Detection

Open Vocabulary Object Detection with Proposal Mining and Prediction Equalization

LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions