Abstract:Open-vocabulary learning has emerged as a cutting-edge research area, particularly in light of the widespread adoption of vision-based foundational models. Its primary objective is to comprehend novel concepts that are not encompassed within a predefined vocabulary. One key facet of this endeavor is Visual Grounding (VG), which entails locating a specific region within an image based on a corresponding language description. While current foundational models excel at various visual language tasks, there's a noticeable absence of models specifically tailored for open-vocabulary visual grounding (OV-VG). This research endeavor introduces novel and challenging OV tasks, namely Open-Vocabulary Visual Grounding (OV-VG) and Open-Vocabulary Phrase Localization (OV-PL). The overarching aim is to establish connections between language descriptions and the localization of novel objects. To facilitate this, we have curated a comprehensive annotated benchmark, encompassing 7,272 OV-VG images (comprising 10,000 instances) and 1,000 OV-PL images. In our pursuit of addressing these challenges, we delved into various baseline methodologies rooted in existing open-vocabulary object detection (OV-D), VG, and phrase localization (PL) frameworks. Surprisingly, we discovered that state-of-the-art (SOTA) methods often falter in diverse scenarios. Consequently, we developed a novel framework that integrates two critical components: Text-Image Query Selection (TIQS) and Language-Guided Feature Attention (LGFA). These modules are designed to bolster the recognition of novel categories and enhance the alignment between visual and linguistic information. Extensive experiments demonstrate the efficacy of our proposed framework, which consistently attains SOTA performance across the OV-VG task. Additionally, ablation studies provide further evidence of the effectiveness of our innovative models. Codes and datasets will be made publicly available at https://github.com/cv516Buaa/OV-VG .

Textual Grounding for Open-vocabulary Visual Information Extraction in Layout-diversified Documents

VSR: A Unified Framework for Document Layout Analysis Combining Vision, Semantics and Relations

Visual-Semantic Graph Matching for Visual Grounding

Layout-Aware Information Extraction for Document-Grounded Dialogue: Dataset, Method and Demonstration

OV-VG: A benchmark for open-vocabulary visual grounding

A LayoutLMv3-Based Model for Enhanced Relation Extraction in Visually-Rich Documents

Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models

GeoLayoutLM: Geometric Pre-training for Visual Information Extraction

Visual Information Extraction in the Wild: Practical Dataset and End-to-end Solution

ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding

LayoutReader: Pre-training of Text and Layout for Reading Order Detection

ReLayout: Towards Real-World Document Understanding via Layout-enhanced Pre-training

LayoutLM: Pre-training of Text and Layout for Document Image Understanding

TRIE++: Towards End-to-End Information Extraction from Visually Rich Documents

LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

End-to-end Visual Grounding Based on Query Text Guidance and Multi-stage Reasoning

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding

Towards Robust Visual Information Extraction in Real World: New Dataset and Novel Solution

ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual Grounding