Object-centric Inference for Language Conditioned Placement: A Foundation Model based Approach

Zhixuan Xu,Kechun Xu,Yue Wang,Rong Xiong

2023-04-06

Abstract:We focus on the task of language-conditioned object placement, in which a robot should generate placements that satisfy all the spatial relational constraints in language instructions. Previous works based on rule-based language parsing or scene-centric visual representation have restrictions on the form of instructions and reference objects or require large amounts of training data. We propose an object-centric framework that leverages foundation models to ground the reference objects and spatial relations for placement, which is more sample efficient and generalizable. Experiments indicate that our model can achieve a 97.75% success rate of placement with only ~0.26M trainable parameters. Besides, our method generalizes better to both unseen objects and instructions. Moreover, with only 25% training data, we still outperform the top competing approach.

Robotics,Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily focuses on the task of language-conditioned object placement, where a robot places objects in specific locations based on natural language instructions provided by the user, while satisfying all spatial relationship constraints. Specifically, the paper aims to address the following issues: 1. **Limitations of Existing Methods**: - Rule-based language parsing methods or scene-centric visual representation methods have limitations on the form of instructions or require a large amount of training data. - Previous frameworks struggle to adapt to new objects and instructions in an open-world setting. 2. **Proposed New Method**: - An object-based framework is proposed, utilizing foundational models to identify reference objects and spatial relationships to improve sample efficiency and generalization capability. - Pre-trained large-scale language models (such as GPT-3) and vision-language models (such as CLIP) are used to parse and encode instructions, handling more flexible instructions. With these improvements, the method achieves a high success rate with a small amount of training data and demonstrates good generalization performance on unseen objects and instructions. Experimental results show that this method outperforms other competing methods in multiple scenarios.

Object-centric Inference for Language Conditioned Placement: A Foundation Model based Approach

Grounding Object Relations in Language-Conditioned Robotic Manipulation with Semantic-Spatial Reasoning

Object-Centric Instruction Augmentation for Robotic Manipulation

Learning to Place New Objects

Learning to Place New Objects in a Scene

Language Conditioned Spatial Relation Reasoning for 3D Object Grounding

Latent Space Planning for Multiobject Manipulation With Environment-Aware Relational Classifiers

Stimulating Imagination: Towards General-purpose Object Rearrangement

Language-Conditioned Imitation Learning with Base Skill Priors under Unstructured Data

Pave the Way to Grasp Anything: Transferring Foundation Models for Universal Pick-Place Robots

Latent Space Planning for Multi-Object Manipulation with Environment-Aware Relational Classifiers

Cross-Modal Match for Language Conditioned 3D Object Grounding

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

Beyond Object Recognition: A New Benchmark towards Object Concept Learning

An End-to-End Approach to Natural Language Object Retrieval Via Context-Aware Deep Reinforcement Learning.

Object-Centric Scene Representations Using Active Inference

Bridging Language and Action: A Survey of Language-Conditioned Robot Manipulation

Improving Vision-and-Language Reasoning via Spatial Relations Modeling

Language-Conditioned Region Proposal and Retrieval Network for Referring Expression Comprehension

Language-guided Active Sensing of Confined, Cluttered Environments via Object Rearrangement Planning

Learning 6-DoF Object Poses to Grasp Category-level Objects by Language Instructions