Language-guided Visual Attention Network for Visual Grounding

Haibo Yao,Lipeng Wang,Chengtao Cai,Wei Wang,Zhi Zhang,Lichao Jiang
DOI: https://doi.org/10.1109/m2vip62491.2024.10746112
2024-01-01
Abstract:Visual grounding (VG) is a critical task that seeks to identify and localize a specific visual region within a given image based on a corresponding referring expression. Existing approaches to the visual grounding (VG) task can be categorized into three main types: two-stage methods, one-stage methods, and Transformer-based methods, which have achieved high performance. However, most of the methods do not exploit the visual and linguistic information well, limiting the performance of model. In this work, we propose a language-guided visual attention network for visual grounding, which can utilize language to deeply explore visual information by better processing of the relationship between vision and language. Specifically, we utilize BERT, a pre-trained model, to get the word-level and sentence-level linguistic features, which can understand the linguistic information more comprehensively. Inspired by the Transformer architecture, we design the stacked visual attention module, which leverages language to direct the attention of vision. In addition, we discuss several ways of fusing visual and linguistic features, enabling a better fusion of visual-linguistic information to obtain the correct coordinates. In a series of comprehensive evaluations on the ReferItGame benchmark dataset, our proposed model is shown to establish a new performance standard.
What problem does this paper attempt to address?