Abstract:We propose Reasoning to Ground (R2G), a neural symbolic model that grounds the target objects within 3D scenes in a reasoning manner. In contrast to prior works, R2G explicitly models the 3D scene with a semantic concept-based scene graph; recurrently simulates the attention transferring across object entities; thus makes the process of grounding the target objects with the highest probability interpretable. Specifically, we respectively embed multiple object properties within the graph nodes and spatial relations among entities within the edges, utilizing a predefined semantic vocabulary. To guide attention transferring, we employ learning or prompting-based methods to analyze the referential utterance and convert it into reasoning instructions within the same semantic space. In each reasoning round, R2G either (1) merges current attention distribution with the similarity between the instruction and embedded entity properties or (2) shifts the attention across the scene graph based on the similarity between the instruction and embedded spatial relations. The experiments on Sr3D/Nr3D benchmarks show that R2G achieves a comparable result with the prior works while maintaining improved interpretability, breaking a new path for 3D language grounding.

What problem does this paper attempt to address?

The paper aims to address the problem of 3D Visual Grounding (3D-VG), specifically targeting the localization of target objects through reasoning given a 3D scene and indicative language descriptions. Compared to previous methods, this paper proposes a new neural-symbolic model named R2G (Reasoning to Ground), which has the following features: 1. **Explicit Modeling**: R2G uses a scene graph based on semantic concepts to represent the 3D scene, embedding multiple object attributes and spatial relationships between entities in the graph. 2. **Interpretable Reasoning**: By parsing the language description to generate guiding information and using this information to shift attention on the scene graph, the target object is gradually localized, making the entire process more transparent and easier to understand. 3. **Attribute-Related Description Handling**: It can handle not only descriptions based on spatial relationships but also complex descriptions that include attribute information. Experimental results show that R2G performs excellently in the Sr3D/Nr3D benchmarks, with performance comparable to existing methods but with improvements in interpretability and generalization ability. Additionally, the paper demonstrates the advantages of R2G in handling natural language descriptions, especially when dealing with simple natural language descriptions, where it performs better than other models.

R2G: Reasoning to Ground in 3D Scenes