Abstract:Vision-Language Models (VLMs) play a crucial role in robotic manipulation by enabling robots to understand and interpret the visual properties of objects and their surroundings, allowing them to perform manipulation based on this multimodal understanding. However, understanding object attributes and spatial relationships is a non-trivial task but is critical in robotic manipulation tasks. In this work, we present a new dataset focused on spatial relationships and attribute assignment and a novel method to utilize VLMs to perform object manipulation with task-oriented, high-level input. In this dataset, the spatial relationships between objects are manually described as captions. Additionally, each object is labeled with multiple attributes, such as fragility, mass, material, and transparency, derived from a fine-tuned vision language model. The embedded object information from captions are automatically extracted and transformed into a data structure (in this case, tree, for demonstration purposes) that captures the spatial relationships among the objects within each image. The tree structures, along with the object attributes, are then fed into a language model to transform into a new tree structure that determines how these objects should be organized in order to accomplish a specific (high-level) task. We demonstrate that our method not only improves the comprehension of spatial relationships among objects in the visual environment but also enables robots to interact with these objects more effectively. As a result, this approach significantly enhances spatial reasoning in robotic manipulation tasks. To our knowledge, this is the first method of its kind in the literature, offering a novel solution that allows robots to more efficiently organize and utilize objects in their surroundings.

Bridging Visual Perception with Contextual Semantics for Understanding Robot Manipulation Tasks

Bridging Low-level Geometry to High-level Concepts in Visual Servoing of Robot Manipulation Task Using Event Knowledge Graphs and Vision-Language Models

Understanding Contexts Inside Robot and Human Manipulation Tasks through a Vision-Language Model and Ontology System in a Video Stream

Grounding Language for Robotic Manipulation via Skill Library

Semantic Representation of Robot Manipulation with Knowledge Graph

Hierarchical Understanding in Robotic Manipulation: A Knowledge-Based Framework

A User Interface for Sense-making of the Reasoning Process while Interacting with Robots

Robot Manipulation in Salient Vision through Referring Image Segmentation and Geometric Constraints

Bridging the Robot Perception Gap with Mid-Level Vision

A Robotic Manipulation Framework for Industrial Human–robot Collaboration Based on Continual Knowledge Graph Embedding

Long-term Robot Manipulation Task Planning with Scene Graph and Semantic Knowledge

Enhancing Interpretability and Interactivity in Robot Manipulation: A Neurosymbolic Approach

Transferring the Semantic Constraints in Human Manipulation Behaviors to Robots

Semantically Safe Robot Manipulation: From Semantic Scene Understanding to Motion Safeguards

Learning Robotic Manipulation through Visual Planning and Acting

Grounding Object Relations in Language-Conditioned Robotic Manipulation with Semantic-Spatial Reasoning

Look Closer: Bridging Egocentric and Third-Person Views with Transformers for Robotic Manipulation

Smart Perception for Situation Awareness in Robotic Manipulation Tasks

Dynamic Open-Vocabulary 3D Scene Graphs for Long-term Language-Guided Mobile Manipulation

Task-oriented Robotic Manipulation with Vision Language Models