Abstract:Our goal is to develop models that allow a robot to efficiently understand or “ground” natural language instructions in the context of its world representation. Contemporary approaches estimate correspondences between language instructions and possible groundings such as objects, regions, and goals for actions that the robot should execute. However, these approaches typically reason in relatively small domains and do not model abstract spatial concepts such as as “rows,” “columns,” or “groups” of objects and, hence, are unable to interpret an instruction such as “pick up the middle block in the row of five blocks.” In this paper, we introduce two new models for efficient natural language understanding of robot instructions. The first model, which we call the adaptive distributed correspondence graph (ADCG), is a probabilistic model for interpreting abstract concepts that require hierarchical reasoning over constituent concrete entities as well as notions of cardinality and ordinality. Abstract grounding variables form a Markov boundary over concrete groundings, effectively de-correlating them from the remaining variables in the graph. This structure reduces the complexity of model training and inference. Inference in the model is posed as an approximate search procedure that orders factor computation such that the estimated probable concrete groundings focus the search for abstract concepts towards likely hypothesis, pruning away improbable portions of the exponentially large space of abstractions. Further, we address the issue of scalability to complex domains and introduce a hierarchical extension to a second model termed the hierarchical adaptive distributed correspondence graph (HADCG). The model utilizes the abstractions in the ADCG but infers a coarse symbolic structure from the utterance and the environment model and then performs fine-grained inference over the reduced graphical model, further improving the efficiency of inference. Empirical evaluation demonstrates accurate grounding of abstract concepts embedded in complex natural language instructions commanding a robotic torso and a mobile robot. Further, the proposed approximate inference method allows significant efficiency gains compared with the baseline, with minimal trade-off in accuracy.

Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph

Free-form Description Guided 3D Visual Graph Network for Object Grounding in Point Cloud

Language grounding with 3d objects

Towards CLIP-driven Language-free 3D Visual Grounding Via 2D-3D Relational Enhancement and Consistency

Point2Graph: An End-to-end Point Cloud-based 3D Open-Vocabulary Scene Graph for Robot Navigation

Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding

Reasoning3D -- Grounding and Reasoning in 3D: Fine-Grained Zero-Shot Open-Vocabulary 3D Reasoning Part Segmentation via Large Vision-Language Models

Clio: Real-time Task-Driven Open-Set 3D Scene Graphs

SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding

A Bottom-up Framework for Construction of Structured Semantic 3D Scene Graph.

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

ScanReason: Empowering 3D Visual Grounding with Reasoning Capabilities

Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation

Exploiting Contextual Objects and Relations for 3D Visual Grounding.

Efficient grounding of abstract spatial concepts for natural language interaction with robot platforms

Object2Scene: Putting Objects in Context for Open-Vocabulary 3D Detection

Situational Awareness Matters in 3D Vision Language Reasoning

3D Feature Distillation with Object-Centric Priors

Language-Grounded Dynamic Scene Graphs for Interactive Object Search with Mobile Manipulation

Lowis3D: Language-Driven Open-World Instance-Level 3D Scene Understanding

Open-Vocabulary Octree-Graph for 3D Scene Understanding