Abstract:Open-vocabulary learning has emerged as a cutting-edge research area, particularly in light of the widespread adoption of vision-based foundational models. Its primary objective is to comprehend novel concepts that are not encompassed within a predefined vocabulary. One key facet of this endeavor is Visual Grounding (VG), which entails locating a specific region within an image based on a corresponding language description. While current foundational models excel at various visual language tasks, there's a noticeable absence of models specifically tailored for open-vocabulary visual grounding (OV-VG). This research endeavor introduces novel and challenging OV tasks, namely Open-Vocabulary Visual Grounding (OV-VG) and Open-Vocabulary Phrase Localization (OV-PL). The overarching aim is to establish connections between language descriptions and the localization of novel objects. To facilitate this, we have curated a comprehensive annotated benchmark, encompassing 7,272 OV-VG images (comprising 10,000 instances) and 1,000 OV-PL images. In our pursuit of addressing these challenges, we delved into various baseline methodologies rooted in existing open-vocabulary object detection (OV-D), VG, and phrase localization (PL) frameworks. Surprisingly, we discovered that state-of-the-art (SOTA) methods often falter in diverse scenarios. Consequently, we developed a novel framework that integrates two critical components: Text-Image Query Selection (TIQS) and Language-Guided Feature Attention (LGFA). These modules are designed to bolster the recognition of novel categories and enhance the alignment between visual and linguistic information. Extensive experiments demonstrate the efficacy of our proposed framework, which consistently attains SOTA performance across the OV-VG task. Additionally, ablation studies provide further evidence of the effectiveness of our innovative models. Codes and datasets will be made publicly available at https://github.com/cv516Buaa/OV-VG .

Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images

GVGNet: Gaze-Directed Visual Grounding for Learning Under-Specified Object Referring Intention

Visual Selection and Multistage Reasoning for RSVG

Language Query-Based Transformer With Multiscale Cross-Modal Alignment for Visual Grounding on Remote Sensing Images

ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding

Progressive Attention-Based Feature Recovery With Scribble Supervision for Saliency Detection in Optical Remote Sensing Image

RSGPT: A Remote Sensing Vision Language Model and Benchmark

Iterative Robust Visual Grounding with Masked Reference based Centerpoint Supervision

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding

OV-VG: A benchmark for open-vocabulary visual grounding

Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques

RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing

LHRS-Bot: Empowering Remote Sensing with VGI-Enhanced Large Multimodal Language Model

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

LHRS-Bot-Nova: Improved Multimodal Large Language Model for Remote Sensing Vision-Language Interpretation

Language Adaptive Weight Generation for Multi-task Visual Grounding

RS5M: A Large Scale Vision-Language Dataset for Remote Sensing Vision-Language Foundation Model

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language Models

Remote Sensing Vision-Language Foundation Models without Annotations via Ground Remote Alignment