Hyperbolic Learning with Synthetic Captions for Open-World Detection

Fanjie Kong,Yanbei Chen,Jiarui Cai,Davide Modolo
2024-04-08
Abstract:Open-world detection poses significant challenges, as it requires the detection of any object using either object class labels or free-form texts. Existing related works often use large-scale manual annotated caption datasets for training, which are extremely expensive to collect. Instead, we propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary descriptions automatically. Specifically, we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images, and incorporate these captions to train a novel detector that generalizes to novel concepts. To mitigate the noise caused by hallucination in synthetic captions, we also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings. We call our detector ``HyperLearner''. We conduct extensive experiments on a wide variety of open-world detection benchmarks (COCO, LVIS, Object Detection in the Wild, RefCOCO) and our results show that our model consistently outperforms existing state-of-the-art methods, such as GLIP, GLIPv2 and Grounding DINO, when using the same backbone.
Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The paper primarily addresses the problem of Open-world Detection, which involves detecting and localizing any object in an image, not limited to predefined category labels but also including free-form text descriptions. Specifically, the study makes the following contributions: 1. **Enhancing Open Vocabulary Detection with Synthetic Captions**: The paper proposes a method to enrich the model's understanding of open vocabulary concepts by utilizing synthetic captions generated by pre-trained vision-language models. These synthetic captions can provide rich descriptions of objects in the image, including both known and unknown objects. 2. **Introducing Hyperbolic Vision-Language Learning**: To address the hallucination problem that may arise with synthetic captions, where captions might contain information unrelated to the image, the paper introduces a new hyperbolic vision-language learning objective. This method establishes a structural hierarchy to ensure alignment between synthetic captions and image features, and it is specifically optimized to tackle the hallucination issue in synthetic captions. 3. **Experimental Results**: The paper demonstrates that the proposed model (named "HyperLearner") achieves state-of-the-art performance on multiple open-world detection benchmark datasets, including COCO, LVIS, Object Detection in the Wild (ODiW), and RefCOCO. Compared to existing techniques, HyperLearner shows superior performance under the same backbone network. In summary, the goal of this paper is to improve the model's generalization ability in open-world detection tasks by leveraging automatically generated synthetic captions and a novel hyperbolic space-based vision-language learning method.