Advancing Visual Grounding with Scene Knowledge: Benchmark and Method

Zhihong Chen,Ruifei Zhang,Yibing Song,Xiang Wan,Guanbin Li
2023-07-21
Abstract:Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities over their joint space. However, most existing VG datasets are constructed using simple description texts, which do not require sufficient reasoning over the images and texts. This has been demonstrated in a recent study~\cite{luo2022goes}, where a simple LSTM-based text encoder without pretraining can achieve state-of-the-art performance on mainstream VG datasets. Therefore, in this paper, we propose a novel benchmark of \underline{S}cene \underline{K}nowledge-guided \underline{V}isual \underline{G}rounding (SK-VG), where the image content and referring expressions are not sufficient to ground the target objects, forcing the models to have a reasoning ability on the long-form scene knowledge. To perform this task, we propose two approaches to accept the triple-type input, where the former embeds knowledge into the image features before the image-query interaction; the latter leverages linguistic structure to assist in computing the image-text matching. We conduct extensive experiments to analyze the above methods and show that the proposed approaches achieve promising results but still leave room for improvement, including performance and interpretability. The dataset and code are available at \url{<a class="link-external link-https" href="https://github.com/zhjohnchan/SK-VG" rel="external noopener nofollow">this https URL</a>}.
Computer Vision and Pattern Recognition,Computation and Language
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the insufficient understanding ability of models for images and texts in the existing Visual Grounding (VG) tasks. Specifically, most of the existing VG datasets are constructed using simple descriptive texts, and these texts do not require sufficient reasoning about images and texts. This has led to a phenomenon where even a simple LSTM text encoder without pre - training can achieve state - of - the - art performance on mainstream VG datasets. Therefore, the author believes that the existing VG datasets cannot well evaluate the reasoning ability and cross - modal understanding ability of models. To meet this challenge, the author proposes a new benchmark - Scene Knowledge - guided Visual Grounding (SK - VG). In this new benchmark, the image content and referring expressions alone are not sufficient to locate the target object, and the model must have the reasoning ability for long - form scene knowledge. The SK - VG dataset contains approximately 40,000 referring expressions and 8,000 scene stories from 4,000 pictures, with each picture containing 2 scene stories and each story having 5 referring expressions. In addition, the author also proposes two methods to handle this task: 1. **Knowledge - embedded Vision - Language Interaction (KeViLI)**: This method first embeds scene knowledge into image features and then performs image - query interaction. 2. **Linguistic - enhanced Vision - Language Matching (LeViLM)**: This method first extracts image features and text features and then uses structured language information to assist in calculating the match between image regions and text entities. Through extensive experiments, the author demonstrates the effectiveness of these two methods, but also points out that there is still room for improvement in some aspects, especially when dealing with complex and difficult tasks.