Abstract:In this paper we present a neurosymbolic architecture for coupling language-guided visual reasoning with robot manipulation. A non-expert human user can prompt the robot using unconstrained natural language, providing a referring expression (REF), a question (VQA), or a grasp action instruction. The system tackles all cases in a task-agnostic fashion through the utilization of a shared library of primitive skills. Each primitive handles an independent sub-task, such as reasoning about visual attributes, spatial relation comprehension, logic and enumeration, as well as arm control. A language parser maps the input query to an executable program composed of such primitives, depending on the context. While some primitives are purely symbolic operations (e.g. counting), others are trainable neural functions (e.g. visual grounding), therefore marrying the interpretability and systematic generalization benefits of discrete symbolic approaches with the scalability and representational power of deep networks. We generate a 3D vision-and-language synthetic dataset of tabletop scenes in a simulation environment to train our approach and perform extensive evaluations in both synthetic and real-world scenes. Results showcase the benefits of our approach in terms of accuracy, sample-efficiency, and robustness to the user's vocabulary, while being transferable to real-world scenes with few-shot visual fine-tuning. Finally, we integrate our method with a robot framework and demonstrate how it can serve as an interpretable solution for an interactive object-picking task, both in simulation and with a real robot. We make our datasets available in <a class="link-external link-https" href="https://gtziafas.github.io/neurosymbolic-manipulation" rel="external noopener nofollow">this https URL</a>.

Grounding Spatio-Semantic Referring Expressions for Human-Robot Interaction

GVGNet: Gaze-Directed Visual Grounding for Learning Under-Specified Object Referring Intention

Grounding Language for Robotic Manipulation via Skill Library

Grounding Object Relations in Language-Conditioned Robotic Manipulation with Semantic-Spatial Reasoning

Object Referring in Visual Scene with Spoken Language

Gaze-assisted visual grounding via knowledge distillation for referred object grasping with under-specified object referring

Robot Manipulation in Salient Vision through Referring Image Segmentation and Geometric Constraints

Perspective-Corrected Spatial Referring Expression Generation for Human-Robot Interaction

Grounding Language in Multi-Perspective Referential Communication

Language-guided Semantic Mapping and Mobile Manipulation in Partially Observable Environments

Audio-Visual Grounding Referring Expression for Robotic Manipulation

Towards Understanding Language through Perception in Situated Human-Robot Interaction: From Word Grounding to Grammar Induction

Semantic Grounding for Long-Term Autonomy of Mobile Robots Towards Dynamic Object Search in Home Environments

Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach

OCID-Ref: A 3D Robotic Dataset with Embodied Language for Clutter Scene Grounding

Grounding Dynamic Spatial Relations for Embodied (Robot) Interaction

Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions

REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments.

A Universal Semantic-Geometric Representation for Robotic Manipulation

Compositional Zero-Shot Learning for Attribute-Based Object Reference in Human-Robot Interaction

ScanERU: Interactive 3D Visual Grounding based on Embodied Reference Understanding