ViewInfer3D: 3D Visual Grounding Based on Embodied Viewpoint Inference

Liang Geng,Jianqin Yin
DOI: https://doi.org/10.1109/lra.2024.3426286
IF: 5.2
2024-01-01
IEEE Robotics and Automation Letters
Abstract:3D Visual Grounding (3D VG) is a fundamental task in embodied intelligence, which entails robots interpreting natural language descriptions to locate objects within 3D environments. The complexity of this task emerges as robots perceive the spatial relationships of objects differently depending on their observational viewpoints. In this work, we propose ViewInfer3D, a framework that leverages Large Language Models (LLMs) to infer embodied viewpoints, thereby avoiding incorrect observational viewpoints. To enhance the reliability and speed of reasoning from embodied viewpoints, we have designed three sub-strategies: constructing a hierarchical 3D scene graph, implementing embodied viewpoint parsing, and applying scene graph reasoning. Through extensive experiments, we demonstrate that this framework can improve performance in 3D Visual Grounding tasks through embodied viewpoint reasoning. Our framework achieves the best performance among all zero-shot methods on the ScanRefer and Nr3D/Sr3D datasets, without significantly increasing inference time.
What problem does this paper attempt to address?