Abstract:Open-vocabulary learning has emerged as a cutting-edge research area, particularly in light of the widespread adoption of vision-based foundational models. Its primary objective is to comprehend novel concepts that are not encompassed within a predefined vocabulary. One key facet of this endeavor is Visual Grounding (VG), which entails locating a specific region within an image based on a corresponding language description. While current foundational models excel at various visual language tasks, there's a noticeable absence of models specifically tailored for open-vocabulary visual grounding (OV-VG). This research endeavor introduces novel and challenging OV tasks, namely Open-Vocabulary Visual Grounding (OV-VG) and Open-Vocabulary Phrase Localization (OV-PL). The overarching aim is to establish connections between language descriptions and the localization of novel objects. To facilitate this, we have curated a comprehensive annotated benchmark, encompassing 7,272 OV-VG images (comprising 10,000 instances) and 1,000 OV-PL images. In our pursuit of addressing these challenges, we delved into various baseline methodologies rooted in existing open-vocabulary object detection (OV-D), VG, and phrase localization (PL) frameworks. Surprisingly, we discovered that state-of-the-art (SOTA) methods often falter in diverse scenarios. Consequently, we developed a novel framework that integrates two critical components: Text-Image Query Selection (TIQS) and Language-Guided Feature Attention (LGFA). These modules are designed to bolster the recognition of novel categories and enhance the alignment between visual and linguistic information. Extensive experiments demonstrate the efficacy of our proposed framework, which consistently attains SOTA performance across the OV-VG task. Additionally, ablation studies provide further evidence of the effectiveness of our innovative models. Codes and datasets will be made publicly available at https://github.com/cv516Buaa/OV-VG .

Introduction To A Large-Scale General Purpose Ground Truth Database: Methodology, Annotation Tool And Benchmarks

Benchmarking Large-Scale Multi-View 3D Reconstruction Using Realistic Synthetic Images

When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach

ObjectNet3D: A Large Scale Database for 3D Object Recognition

LabelMe: Online Image Annotation and Applications

SUN database: Large-scale scene recognition from abbey to zoo

AI Challenger : A Large-scale Dataset for Going Deeper in Image Understanding

ReCon1M:A Large-scale Benchmark Dataset for Relation Comprehension in Remote Sensing Imagery

Towards Good Practices for Efficiently Annotating Large-Scale Image Classification Datasets

Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy

Q-Ground: Image Quality Grounding with Large Multi-modality Models

Learning Visual Grounding from Generative Vision and Language Model

An annotated grain kernel image database for visual quality inspection

Towards Large-Scale Small Object Detection: Survey and Benchmarks.

ARISTA - Image Search to Annotation on Billions of Web Photos

Smartannotator an Interactive Tool for Annotating Indoor Rgbd Images

ImageNet Large Scale Visual Recognition Challenge

ImageNet: A large-scale hierarchical image database

OV-VG: A benchmark for open-vocabulary visual grounding