Abstract:Open-vocabulary learning has emerged as a cutting-edge research area, particularly in light of the widespread adoption of vision-based foundational models. Its primary objective is to comprehend novel concepts that are not encompassed within a predefined vocabulary. One key facet of this endeavor is Visual Grounding (VG), which entails locating a specific region within an image based on a corresponding language description. While current foundational models excel at various visual language tasks, there's a noticeable absence of models specifically tailored for open-vocabulary visual grounding (OV-VG). This research endeavor introduces novel and challenging OV tasks, namely Open-Vocabulary Visual Grounding (OV-VG) and Open-Vocabulary Phrase Localization (OV-PL). The overarching aim is to establish connections between language descriptions and the localization of novel objects. To facilitate this, we have curated a comprehensive annotated benchmark, encompassing 7,272 OV-VG images (comprising 10,000 instances) and 1,000 OV-PL images. In our pursuit of addressing these challenges, we delved into various baseline methodologies rooted in existing open-vocabulary object detection (OV-D), VG, and phrase localization (PL) frameworks. Surprisingly, we discovered that state-of-the-art (SOTA) methods often falter in diverse scenarios. Consequently, we developed a novel framework that integrates two critical components: Text-Image Query Selection (TIQS) and Language-Guided Feature Attention (LGFA). These modules are designed to bolster the recognition of novel categories and enhance the alignment between visual and linguistic information. Extensive experiments demonstrate the efficacy of our proposed framework, which consistently attains SOTA performance across the OV-VG task. Additionally, ablation studies provide further evidence of the effectiveness of our innovative models. Codes and datasets will be made publicly available at https://github.com/cv516Buaa/OV-VG .

VLG: General Video Recognition with Web Textual Knowledge

ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition.

VG4D: Vision-Language Model Goes 4D Video Recognition

Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models

OVGNet: A Unified Visual-Linguistic Framework for Open-Vocabulary Robotic Grasping

Towards Open-Vocabulary Video Instance Segmentation

LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video Question Answering

Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training

BiLL-VTG: Bridging Large Language Models and Lightweight Visual Tools for Video-based Texts Generation

OV-VG: A benchmark for open-vocabulary visual grounding

Video Captioning Using Global-Local Representation

VicTR: Video-conditioned Text Representations for Activity Recognition

VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending

Video Action Recognition with Attentive Semantic Units

Training-free Video Temporal Grounding using Large-scale Pre-trained Models

GL-RG: Global-Local Representation Granularity for Video Captioning

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions

Fine-grained Knowledge Graph-driven Video-Language Learning for Action Recognition

UATVR: Uncertainty-Adaptive Text-Video Retrieval