Abstract:Vision-Language Models (VLMs) play a crucial role in robotic manipulation by enabling robots to understand and interpret the visual properties of objects and their surroundings, allowing them to perform manipulation based on this multimodal understanding. However, understanding object attributes and spatial relationships is a non-trivial task but is critical in robotic manipulation tasks. In this work, we present a new dataset focused on spatial relationships and attribute assignment and a novel method to utilize VLMs to perform object manipulation with task-oriented, high-level input. In this dataset, the spatial relationships between objects are manually described as captions. Additionally, each object is labeled with multiple attributes, such as fragility, mass, material, and transparency, derived from a fine-tuned vision language model. The embedded object information from captions are automatically extracted and transformed into a data structure (in this case, tree, for demonstration purposes) that captures the spatial relationships among the objects within each image. The tree structures, along with the object attributes, are then fed into a language model to transform into a new tree structure that determines how these objects should be organized in order to accomplish a specific (high-level) task. We demonstrate that our method not only improves the comprehension of spatial relationships among objects in the visual environment but also enables robots to interact with these objects more effectively. As a result, this approach significantly enhances spatial reasoning in robotic manipulation tasks. To our knowledge, this is the first method of its kind in the literature, offering a novel solution that allows robots to more efficiently organize and utilize objects in their surroundings.

Combining VLM and LLM for Enhanced Semantic Object Perception in Robotic Handover Tasks

A Human-Robot Collaboration System for Object Handover

Leveraging Semantic and Geometric Information for Zero-Shot Robot-to-Human Handover

ChatNav: Leveraging LLM to Zero-shot Semantic Reasoning in Object Navigation

Decision-Making in Robotic Grasping with Large Language Models.

ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

LLMs for Robotic Object Disambiguation

TalkWithMachines: Enhancing Human-Robot Interaction for Interpretable Industrial Robotics Through Large/Vision Language Models

LLM-Based Human-Robot Collaboration Framework for Manipulation Tasks

Grounding Object Relations in Language-Conditioned Robotic Manipulation with Semantic-Spatial Reasoning

Large Language Models for Robotics: Opportunities, Challenges, and Perspectives

Object-Centric Instruction Augmentation for Robotic Manipulation

Task-oriented Robotic Manipulation with Vision Language Models

Multi-GraspLLM: A Multimodal LLM for Multi-Hand Semantic Guided Grasp Generation

LLMFormer: Large Language Model for Open-Vocabulary Semantic Segmentation

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Enhancing the LLM-Based Robot Manipulation Through Human-Robot Collaboration

Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception

L3MVN: Leveraging Large Language Models for Visual Target Navigation

ManipVQA: Injecting Robotic Affordance and Physically Grounded Information into Multi-Modal Large Language Models