Object-centric Inference for Language Conditioned Placement: A Foundation Model based Approach

Zhixuan Xu,Kechun Xu,Yue Wang,Rong Xiong
2023-04-06
Abstract:We focus on the task of language-conditioned object placement, in which a robot should generate placements that satisfy all the spatial relational constraints in language instructions. Previous works based on rule-based language parsing or scene-centric visual representation have restrictions on the form of instructions and reference objects or require large amounts of training data. We propose an object-centric framework that leverages foundation models to ground the reference objects and spatial relations for placement, which is more sample efficient and generalizable. Experiments indicate that our model can achieve a 97.75% success rate of placement with only ~0.26M trainable parameters. Besides, our method generalizes better to both unseen objects and instructions. Moreover, with only 25% training data, we still outperform the top competing approach.
Robotics,Artificial Intelligence,Machine Learning
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper primarily focuses on the task of language-conditioned object placement, where a robot places objects in specific locations based on natural language instructions provided by the user, while satisfying all spatial relationship constraints. Specifically, the paper aims to address the following issues: 1. **Limitations of Existing Methods**: - Rule-based language parsing methods or scene-centric visual representation methods have limitations on the form of instructions or require a large amount of training data. - Previous frameworks struggle to adapt to new objects and instructions in an open-world setting. 2. **Proposed New Method**: - An object-based framework is proposed, utilizing foundational models to identify reference objects and spatial relationships to improve sample efficiency and generalization capability. - Pre-trained large-scale language models (such as GPT-3) and vision-language models (such as CLIP) are used to parse and encode instructions, handling more flexible instructions. With these improvements, the method achieves a high success rate with a small amount of training data and demonstrates good generalization performance on unseen objects and instructions. Experimental results show that this method outperforms other competing methods in multiple scenarios.