Hyperbolic Learning with Synthetic Captions for Open-World Detection

Fanjie Kong,Yanbei Chen,Jiarui Cai,Davide Modolo

2024-04-08

Abstract:Open-world detection poses significant challenges, as it requires the detection of any object using either object class labels or free-form texts. Existing related works often use large-scale manual annotated caption datasets for training, which are extremely expensive to collect. Instead, we propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary descriptions automatically. Specifically, we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images, and incorporate these captions to train a novel detector that generalizes to novel concepts. To mitigate the noise caused by hallucination in synthetic captions, we also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings. We call our detector ``HyperLearner''. We conduct extensive experiments on a wide variety of open-world detection benchmarks (COCO, LVIS, Object Detection in the Wild, RefCOCO) and our results show that our model consistently outperforms existing state-of-the-art methods, such as GLIP, GLIPv2 and Grounding DINO, when using the same backbone.

Computer Vision and Pattern Recognition

What problem does this paper attempt to address?

The paper primarily addresses the problem of Open-world Detection, which involves detecting and localizing any object in an image, not limited to predefined category labels but also including free-form text descriptions. Specifically, the study makes the following contributions: 1. **Enhancing Open Vocabulary Detection with Synthetic Captions**: The paper proposes a method to enrich the model's understanding of open vocabulary concepts by utilizing synthetic captions generated by pre-trained vision-language models. These synthetic captions can provide rich descriptions of objects in the image, including both known and unknown objects. 2. **Introducing Hyperbolic Vision-Language Learning**: To address the hallucination problem that may arise with synthetic captions, where captions might contain information unrelated to the image, the paper introduces a new hyperbolic vision-language learning objective. This method establishes a structural hierarchy to ensure alignment between synthetic captions and image features, and it is specifically optimized to tackle the hallucination issue in synthetic captions. 3. **Experimental Results**: The paper demonstrates that the proposed model (named "HyperLearner") achieves state-of-the-art performance on multiple open-world detection benchmark datasets, including COCO, LVIS, Object Detection in the Wild (ODiW), and RefCOCO. Compared to existing techniques, HyperLearner shows superior performance under the same backbone network. In summary, the goal of this paper is to improve the model's generalization ability in open-world detection tasks by leveraging automatically generated synthetic captions and a novel hyperbolic space-based vision-language learning method.

Hyperbolic Learning with Synthetic Captions for Open-World Detection

From Open Vocabulary to Open World: Teaching Vision Language Models to Detect Novel Objects

Open-Vocabulary Object Detection using Pseudo Caption Labels

YOLO-World: Real-Time Open-Vocabulary Object Detection

EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Detecting the open-world objects with the help of the Brain

From Captions to Visual Concepts and Back

Learning Object-Language Alignments for Open-Vocabulary Object Detection

PLA: Language-Driven Open-Vocabulary 3D Scene Understanding

DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection

Video OWL-ViT: Temporally-consistent open-world localization in video

OW-VISCapTor: Abstractors for Open-World Video Instance Segmentation and Captioning

Harnessing the Power of Large Vision Language Models for Synthetic Image Detection

GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection

Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

HallE-Switch: Rethinking and Controlling Object Existence Hallucinations in Large Vision Language Models for Detailed Caption

Learning Human-Human Interactions in Images from Weak Textual Supervision

Open-World Visual Recognition Using Knowledge Graphs

SKDF: A Simple Knowledge Distillation Framework for Distilling Open-Vocabulary Knowledge to Open-world Object Detector

Hallucination Elimination and Semantic Enhancement Framework for Vision-Language Models in Traffic Scenarios