Programmatically Grounded, Compositionally Generalizable Robotic Manipulation

Renhao Wang,Jiayuan Mao,Joy Hsu,Hang Zhao,Jiajun Wu,Yang Gao
2023-04-27
Abstract:Robots operating in the real world require both rich manipulation skills as well as the ability to semantically reason about when to apply those skills. Towards this goal, recent works have integrated semantic representations from large-scale pretrained vision-language (VL) models into manipulation models, imparting them with more general reasoning capabilities. However, we show that the conventional pretraining-finetuning pipeline for integrating such representations entangles the learning of domain-specific action information and domain-general visual information, leading to less data-efficient training and poor generalization to unseen objects and tasks. To this end, we propose ProgramPort, a modular approach to better leverage pretrained VL models by exploiting the syntactic and semantic structures of language instructions. Our framework uses a semantic parser to recover an executable program, composed of functional modules grounded on vision and action across different modalities. Each functional module is realized as a combination of deterministic computation and learnable neural networks. Program execution produces parameters to general manipulation primitives for a robotic end-effector. The entire modular network can be trained with end-to-end imitation learning objectives. Experiments show that our model successfully disentangles action and perception, translating to improved zero-shot and compositional generalization in a variety of manipulation behaviors. Project webpage at: \url{<a class="link-external link-https" href="https://progport.github.io" rel="external noopener nofollow">this https URL</a>}.
Artificial Intelligence,Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The paper primarily aims to address the issue of robots needing to possess rich manipulation skills and the ability to semantically reason about when to apply these skills in real-world operations. To achieve this goal, recent works have attempted to integrate the semantic representations from large-scale pre-trained Vision-Language (VL) models into manipulation models to endow them with more general reasoning capabilities. However, the paper points out that the traditional pre-training-fine-tuning pipeline entangles domain-specific action information with domain-general visual information when integrating such representations, leading to inefficient training and poor generalization performance, especially when faced with unseen objects and tasks. To address the above issues, the authors propose a method called PROGRAMPORT. This is a modular approach that leverages pre-trained Vision-Language models (such as CLIP) and the syntactic and semantic structures in natural language instructions to better utilize these models. Specifically, the framework uses a semantic parser to recover executable programs composed of functional modules that are grounded in different modalities (vision and action). Each functional module is a combination of deterministic computation and learnable neural networks. Program execution generates parameters for general manipulation primitives for the robot's end-effector. The entire modular network can be trained through end-to-end imitation learning objectives. Experiments show that the model successfully separates action and perception, thereby achieving improved zero-shot and compositional generalization across various manipulation behaviors. In summary, the main contribution of the paper is the proposal of a modular framework that effectively utilizes pre-trained Vision-Language models, which not only improves data efficiency but also enhances the model's generalization ability to unseen objects and tasks.