Lgvc: language-guided visual context modeling for 3D visual grounding

Liang Geng,Jianqin Yin,Yingchun Niu
DOI: https://doi.org/10.1007/s00521-024-09764-1
2024-01-01
Neural Computing and Applications
Abstract:3D visual grounding is crucial for understanding cross-modal scenes, linking visual objects to their corresponding language descriptions. Traditional methods often use fixed attention patterns in visual encoders, limiting the utility of language-guided attention mechanisms. To address this, we introduce a novel language-guided visual context modeling (LGVC) strategy. Our approach enriches the visual encoding at multiple levels through language knowledge: (1) A Language-Object Embedding (LOE) Module directs attention toward language-relevant proposals in 3D visual scenes, and (2) a Language-Relation Embedding (LRE) Module explores the relationships among objects in the context of accompanying text. Extensive experiments show that LGVC efficiently filters out language-irrelevant proposals and aligns multimodal entities, outperforming state-of-the-art methods.
What problem does this paper attempt to address?