Abstract:In this paper we present a neurosymbolic architecture for coupling language-guided visual reasoning with robot manipulation. A non-expert human user can prompt the robot using unconstrained natural language, providing a referring expression (REF), a question (VQA), or a grasp action instruction. The system tackles all cases in a task-agnostic fashion through the utilization of a shared library of primitive skills. Each primitive handles an independent sub-task, such as reasoning about visual attributes, spatial relation comprehension, logic and enumeration, as well as arm control. A language parser maps the input query to an executable program composed of such primitives, depending on the context. While some primitives are purely symbolic operations (e.g. counting), others are trainable neural functions (e.g. visual grounding), therefore marrying the interpretability and systematic generalization benefits of discrete symbolic approaches with the scalability and representational power of deep networks. We generate a 3D vision-and-language synthetic dataset of tabletop scenes in a simulation environment to train our approach and perform extensive evaluations in both synthetic and real-world scenes. Results showcase the benefits of our approach in terms of accuracy, sample-efficiency, and robustness to the user's vocabulary, while being transferable to real-world scenes with few-shot visual fine-tuning. Finally, we integrate our method with a robot framework and demonstrate how it can serve as an interpretable solution for an interactive object-picking task, both in simulation and with a real robot. We make our datasets available in <a class="link-external link-https" href="https://gtziafas.github.io/neurosymbolic-manipulation" rel="external noopener nofollow">this https URL</a>.

Grounding Language for Robotic Manipulation via Skill Library

Learning Robot Manipulation Skills from Human Demonstration Videos Using Two-Stream 2-D/3-D Residual Networks with Self-Attention

GSC: A Graph-Based Skill Composition Framework for Robot Learning

Grounding Object Relations in Language-Conditioned Robotic Manipulation with Semantic-Spatial Reasoning

Grounding Language with Visual Affordances over Unstructured Data

Self-driven Grounding: Large Language Model Agents with Automatical Language-aligned Skill Learning

Programmatically Grounded, Compositionally Generalizable Robotic Manipulation

Agentic Skill Discovery

Decision-Making in Robotic Grasping with Large Language Models.

Collaborative Language Grounding Toward Situated Human‐Robot Dialogue

Grounding Language to Autonomously-Acquired Skills via Goal Generation

Learning Manipulation Skills through Robot Chain-of-Thought with Sparse Failure Guidance

Grounding Robot Policies with Visuomotor Language Guidance

Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation

Ground4Act: Leveraging Visual-Language Model for Collaborative Pushing and Grasping in Clutter

Learning Generalizable 3D Manipulation With 10 Demonstrations

STEER: Flexible Robotic Manipulation via Dense Language Grounding

Grounding Language Models in Autonomous Loco-manipulation Tasks

Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach

Lifelong Robot Library Learning: Bootstrapping Composable and Generalizable Skills for Embodied Control with Language Models

Audio-Visual Grounding Referring Expression for Robotic Manipulation