Abstract:Visual grounding (VG) is a task that requires to locate a specific region in an image according to a natural language expression. Existing efforts on the VG task are divided into two-stage, one-stage and Transformer-based methods, which have achieved high performance. However, most of the previous methods extract visual information at a single spatial scale and ignore visual information at other spatial scales, which makes these models unable to fully utilize the visual information. Moreover, the insufficient utilization of linguistic information, especially failure to capture global linguistic information, may lead to failure to fully understand language expressions, thus limiting the performance of these models. To better address the task, we propose a language conditioned multi-scale visual attention network (LMSVA) for visual grounding, which can sufficiently utilize visual and linguistic information to perform multimodal reasoning, thus improving performance of model. Specifically, we design a visual feature extractor containing a multi-scale layer to get the required multi-scale visual features by expanding the original backbone. Moreover, we exploit pooling the output of the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model to extract sentence-level linguistic features, which can enable the model to capture global linguistic information. Inspired by the Transformer architecture, we present the Visual Attention Layer guided by Language and Multi-Scale Visual Features (VALMS), which is able to better learn the visual context guided by multi-scale visual and linguistic features, and facilitates further multimodal reasoning. Extensive experiments on four large benchmark datasets, including ReferItGame, RefCOCO, RefCOCO+ and RefCOCOg, demonstrate that our proposed model achieves the state-of-the-art performance.

Lgvc: language-guided visual context modeling for 3D visual grounding

Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge

Exploiting Contextual Objects and Relations for 3D Visual Grounding.

Towards CLIP-driven Language-free 3D Visual Grounding Via 2D-3D Relational Enhancement and Consistency

Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Language Conditioned Multi-Scale Visual Attention Networks for Visual Grounding

Language-guided Visual Attention Network for Visual Grounding

LLM-Optic: Unveiling the Capabilities of Large Language Models for Universal Visual Grounding

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

Scene-LLM: Extending Language Model for 3D Visual Understanding and Reasoning

G$^3$-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding

Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners

Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling

ViewInfer3D: 3D Visual Grounding Based on Embodied Viewpoint Inference

GeoVLN: Learning Geometry-Enhanced Visual Representation with Slot Attention for Vision-and-Language Navigation

Progressive Language-Customized Visual Feature Learning for One-Stage Visual Grounding.

Language-Guided Diffusion Model for Visual Grounding

Grounded 3D-LLM with Referent Tokens

LLaVA-Grounding: Grounded Visual Chat with Large Multimodal Models

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling

LSceneLLM: Enhancing Large 3D Scene Understanding Using Adaptive Visual Preferences