Abstract:Visual grounding tasks aim to localize image regions based on natural language references. In this work, we explore whether generative VLMs predominantly trained on image-text data could be leveraged to scale up the text annotation of visual grounding data. We find that grounding knowledge already exists in generative VLM and can be elicited by proper prompting. We thus prompt a VLM to generate object-level descriptions by feeding it object regions from existing object detection datasets. We further propose attribute modeling to explicitly capture the important object attributes, and spatial relation modeling to capture inter-object relationship, both of which are common linguistic pattern in referring expression. Our constructed dataset (500K images, 1M objects, 16M referring expressions) is one of the largest grounding datasets to date, and the first grounding dataset with purely model-generated queries and human-annotated objects. To verify the quality of this data, we conduct zero-shot transfer experiments to the popular RefCOCO benchmarks for both referring expression comprehension (REC) and segmentation (RES) tasks. On both tasks, our model significantly outperform the state-of-the-art approaches without using human annotated visual grounding data. Our results demonstrate the promise of generative VLM to scale up visual grounding in the real world. Code and models will be released.
What problem does this paper attempt to address?
### Problems Addressed by the Paper
This paper aims to address the issue of limited dataset size in the Visual Grounding task. Specifically, the goal of the Visual Grounding task is to locate specific regions in an image based on natural language queries. This type of task is significant in applications such as visual reasoning and human-computer interaction. However, existing Visual Grounding datasets are small in scale and costly to expand due to the need for detailed manual annotations.
To overcome this challenge, the authors propose a method to automatically generate large-scale Visual Grounding data using Generative Vision-Language Models (VLMs). This method allows for the generation of a large number of high-quality region-text pairs without the need for additional manual annotations, thereby expanding the scale of Visual Grounding datasets.
### Main Contributions
1. **Hypothesis and Empirical Evidence**: The authors hypothesize and empirically demonstrate that generative VLMs pre-trained on image-text pairs can naturally generate high-quality object-level descriptions for individual object regions. By doing so, they use general VLMs to automatically generate referring expression annotations, providing supervision for specialized Visual Grounding models and overcoming the limitations of small-scale grounding datasets.
2. **Enriching Automatic Annotations**: The authors further enrich the automatically generated region-text pairs by leveraging spatial relationship heuristics and cross-domain knowledge of object attributes, providing more comprehensive and diverse query annotations.
3. **Introduction of VLM-VG Dataset**: The authors introduce VLM-VG, a Visual Grounding dataset for scalable referring expression comprehension/segmentation without manual annotations. By pre-training on VLM-VG, the authors achieve state-of-the-art zero-shot performance on the RefCOCO/+/g benchmarks using a lightweight Faster R-CNN model.
### Method Overview
1. **Generative VLM**: Generative Vision-Language Models generate text through autoregressive or masked prediction. By designing different prompts, these models can handle various tasks without specific task training objectives.
2. **Referring Expression Generation**:
- **Region Description**: By cropping individual object regions in an image and using VLMs to generate text describing the central object.
- **Relationship Modeling**: Using bounding box positional information to generate descriptions of spatial relationships between objects.
- **Attribute Modeling**: Querying VLMs to generate descriptions of object attributes such as color, material, etc.
3. **From Detection to Grounding**: Based on the COCO 2017 and Objects365 v1 detection datasets, the VLM-VG dataset is generated. This dataset contains 512K images, 1.1M objects, and 16.2M referring expressions, with an average of 14.7 referring expressions per object.
### Experimental Results
1. **Referring Expression Comprehension (REC)**: In zero-shot evaluations on the RefCOCO/+/g datasets, the authors' method significantly outperforms existing zero-shot methods, particularly on the RefCOCO subset that requires spatial relationship modeling, with a performance improvement of up to 7.4 percentage points.
2. **Referring Expression Segmentation (RES)**: In zero-shot evaluations on the RefCOCO/+/g datasets, the authors' method achieves near-best performance across all three datasets, with average improvements of 5.1% and 4.1% in oIOU and mIOU metrics, respectively.
3. **Beyond Detection Datasets**: By leveraging generative VLMs, the authors' method demonstrates robustness and adaptability across different datasets, capable of generating high-quality Visual Grounding data without relying on manual annotations.
In summary, this paper effectively addresses the issue of limited dataset size in existing Visual Grounding tasks by using generative VLMs to automatically generate large-scale Visual Grounding data, providing new insights and methods for the development of Visual Grounding tasks.