Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

Zhihong Chen,Ruifei Zhang,Yibing Song,Xiang Wan,Guanbin Li

2023-07-21

Abstract:Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities over their joint space. However, most existing VG datasets are constructed using simple description texts, which do not require sufficient reasoning over the images and texts. This has been demonstrated in a recent study~\cite{luo2022goes}, where a simple LSTM-based text encoder without pretraining can achieve state-of-the-art performance on mainstream VG datasets. Therefore, in this paper, we propose a novel benchmark of \underline{S}cene \underline{K}nowledge-guided \underline{V}isual \underline{G}rounding (SK-VG), where the image content and referring expressions are not sufficient to ground the target objects, forcing the models to have a reasoning ability on the long-form scene knowledge. To perform this task, we propose two approaches to accept the triple-type input, where the former embeds knowledge into the image features before the image-query interaction; the latter leverages linguistic structure to assist in computing the image-text matching. We conduct extensive experiments to analyze the above methods and show that the proposed approaches achieve promising results but still leave room for improvement, including performance and interpretability. The dataset and code are available at \url{<a class="link-external link-https" href="https://github.com/zhjohnchan/SK-VG" rel="external noopener nofollow">this https URL</a>}.

Computer Vision and Pattern Recognition,Computation and Language

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the insufficient understanding ability of models for images and texts in the existing Visual Grounding (VG) tasks. Specifically, most of the existing VG datasets are constructed using simple descriptive texts, and these texts do not require sufficient reasoning about images and texts. This has led to a phenomenon where even a simple LSTM text encoder without pre - training can achieve state - of - the - art performance on mainstream VG datasets. Therefore, the author believes that the existing VG datasets cannot well evaluate the reasoning ability and cross - modal understanding ability of models. To meet this challenge, the author proposes a new benchmark - Scene Knowledge - guided Visual Grounding (SK - VG). In this new benchmark, the image content and referring expressions alone are not sufficient to locate the target object, and the model must have the reasoning ability for long - form scene knowledge. The SK - VG dataset contains approximately 40,000 referring expressions and 8,000 scene stories from 4,000 pictures, with each picture containing 2 scene stories and each story having 5 referring expressions. In addition, the author also proposes two methods to handle this task: 1. **Knowledge - embedded Vision - Language Interaction (KeViLI)**: This method first embeds scene knowledge into image features and then performs image - query interaction. 2. **Linguistic - enhanced Vision - Language Matching (LeViLM)**: This method first extracts image features and text features and then uses structured language information to assist in calculating the match between image regions and text entities. Through extensive experiments, the author demonstrates the effectiveness of these two methods, but also points out that there is still room for improvement in some aspects, especially when dealing with complex and difficult tasks.

Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

OV-VG: A benchmark for open-vocabulary visual grounding

Learning to Generate Language-Supervised and Open-Vocabulary Scene Graph Using Pre-Trained Visual-Semantic Space

Benchmarking Knowledge-driven Zero-shot Learning

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach

Expanding Scene Graph Boundaries: Fully Open-vocabulary Scene Graph Generation via Visual-Concept Alignment and Retention

GeoGround: A Unified Large Vision-Language Model. for Remote Sensing Visual Grounding

From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

Learning Visual Grounding from Generative Vision and Language Model

Knowledge-Embedded Mutual Guidance for Visual Reasoning

VLG: General Video Recognition with Web Textual Knowledge

On the General Value of Evidence, and Bilingual Scene-Text Visual Question Answering

Joint Visual Grounding with Language Scene Graphs

Scene Graph Generation with Role-Playing Large Language Models

Scene-Text Grounding for Text-Based Video Question Answering

SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models

Broaden the Vision: Geo-Diverse Visual Commonsense Reasoning

Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset

What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation

Visual Grounding Via Accumulated Attention