Scaling Up and Distilling Down: Language-Guided Robot Skill Acquisition

Huy Ha,Pete Florence,Shuran Song
2023-10-01
Abstract:We present a framework for robot skill acquisition, which 1) efficiently scale up data generation of language-labelled robot data and 2) effectively distills this data down into a robust multi-task language-conditioned visuo-motor policy. For (1), we use a large language model (LLM) to guide high-level planning, and sampling-based robot planners (e.g. motion or grasp samplers) for generating diverse and rich manipulation trajectories. To robustify this data-collection process, the LLM also infers a code-snippet for the success condition of each task, simultaneously enabling the data-collection process to detect failure and retry as well as the automatic labeling of trajectories with success/failure. For (2), we extend the diffusion policy single-task behavior-cloning approach to multi-task settings with language conditioning. Finally, we propose a new multi-task benchmark with 18 tasks across five domains to test long-horizon behavior, common-sense reasoning, tool-use, and intuitive physics. We find that our distilled policy successfully learned the robust retrying behavior in its data collection procedure, while improving absolute success rates by 33.2% on average across five domains. Code, data, and additional qualitative results are available on <a class="link-external link-https" href="https://www.cs.columbia.edu/~huy/scalingup/" rel="external noopener nofollow">this https URL</a>.
Robotics
What problem does this paper attempt to address?
This paper proposes a framework for robot skill acquisition, aiming to address the problem of efficiently expanding data collection and effectively learning visual-motor strategies in the context of multi-task language conditions. The framework consists of two key parts: 1. Data expansion: Using a large language model (LLM) for high-level planning, combined with sampling-based robot planners (such as motion or grasp samplers) to generate diverse manipulation trajectories. The LLM also infers code snippets for the success conditions of each task, enabling the data collection process to detect failures and retries, while automatically labeling the trajectories as successful or failed. 2. Knowledge distillation: Extending the diffusion strategy to a multi-task setting, learning a closed-loop visual-language-motion strategy through language-conditioned training. The learned strategy successfully learns robust retry behavior exhibited during the data collection process and improves the average absolute success rate by 33.2% across five domains. Additionally, the paper proposes a new multi-task benchmark that includes 18 tasks covering five domains, testing long-term behaviors, common-sense reasoning, tool usage, and intuitive understanding of physics. The core of the research methodology is the efficient exploration using the common-sense reasoning ability of the LLM while learning reusable 6-DoF skills for real-world applications. The paper demonstrates through experiments that the proposed framework outperforms other methods in terms of data generation efficiency and strategy learning effectiveness, and can be directly transferred to the real world without fine-tuning.