Editorial Paper for Pattern Recognition Letters VSI on Cross Model Understanding for Visual Question Answering

Shaohua Wan,Zan Gao,Hanwang Zhang,Chang Xiaojun,Chen,Anastasios Tefas
DOI: https://doi.org/10.1016/j.patrec.2022.06.001
IF: 4.757
2022-01-01
Pattern Recognition Letters
Abstract:Referring expression grounding plays a fundamental role in vision-language understanding, which aims at locating a certain target region in an image described by a natural language expression. It needs to understand high-level semantic correlations between objects in the image according to the referred expression for the task. Thus, it inherently requires reasoning the context information, i.e., appearance context and relationship context. While most existing approaches either ignore to explore the appearance details of the target region or rely on a manually designed reasoning structure and treat the context information of each neighboring object equivalently, inflexible to the scenario where referring expressions are complicated. In this paper, we put forward Multi-context Reasoning Network (MCRN) for referring expression grounding task, which can apply appearance context reasoning and relationship context reasoning simultaneously. Methodologically, for appearance context reasoning, we propose a local node attention to obtain local representation of the target object, which gives a more focus on its appearance details. For relationship context reasoning, we approach it as a language-guided multi-step reasoning problem and design a multi-step graph reasoning module to capture intra-context and inter-context between the target region of its intra-class and inter-class neighboring objects in an iterative way, which makes the reasoning process more reliable and interpretable. Our method demonstrates superiority based on extensive experimental outputs on three popular benchmark datasets.
What problem does this paper attempt to address?