Keypoint Abstraction using Large Models for Object-Relative Imitation Learning

Xiaolin Fang,Bo-Ruei Huang,Jiayuan Mao,Jasmine Shone,Joshua B. Tenenbaum,Tomás Lozano-Pérez,Leslie Pack Kaelbling
2024-10-31
Abstract:Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics. Keypoint-based representations have been proven effective as a succinct representation for capturing essential object features, and for establishing a reference frame in action prediction, enabling data-efficient learning of robot skills. However, their manual design nature and reliance on additional human labels limit their scalability. In this paper, we propose KALM, a framework that leverages large pre-trained vision-language models (LMs) to automatically generate task-relevant and cross-instance consistent keypoints. KALM distills robust and consistent keypoints across views and objects by generating proposals using LMs and verifies them against a small set of robot demonstration data. Based on the generated keypoints, we can train keypoint-conditioned policy models that predict actions in keypoint-centric frames, enabling robots to generalize effectively across varying object poses, camera views, and object instances with similar functional shapes. Our method demonstrates strong performance in the real world, adapting to different tasks and environments from only a handful of demonstrations while requiring no additional labels. Website: <a class="link-external link-https" href="https://kalm-il.github.io/" rel="external noopener nofollow">this https URL</a>
Robotics,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
### The Problem the Paper Attempts to Solve This paper aims to address a key challenge in robotics: the ability to generalize to new object configurations and instances across different tasks and environments. Specifically, the authors focus on how to achieve efficient data-driven robot skill learning through keypoint representation. Keypoint representation has been proven to be an effective and concise method that captures the essential features of objects and establishes a reference frame in action prediction, thereby enabling data-efficient robot skill learning. However, manually designing keypoints and their reliance on additional human labels limit the scalability of this approach. To this end, the paper proposes a framework named KALM (Keypoint Abstraction using Large Models for Object-Relative Imitation Learning), which leverages large-scale pre-trained vision-language models (LMs) to automatically generate task-relevant and cross-instance consistent keypoints. KALM extracts robust and consistent keypoints by generating proposals and validating them, and then trains a conditional policy model based on these keypoints to predict keypoint-centric actions. This allows robots to effectively generalize across different object poses, camera viewpoints, and object instances with similar functional shapes. ### Main Contributions 1. **Keypoint Extraction**: A method combining proposal and validation processes is proposed to extract task-relevant and cross-instance consistent keypoints from large-scale pre-trained models. 2. **Action Representation**: Based on the extracted keypoints and their features, a keypoint-centric, object-relative action representation is constructed, which can be learned from a few demonstrations through a diffusion policy model. 3. **Generalization Capability**: Experiments demonstrate that the proposed framework exhibits strong generalization capabilities in the real world, adapting to changes in different tasks and environments without requiring additional labels. ### Method Overview 1. **Problem Definition**: For each skill, a successful task execution video, a few demonstration trajectories (5 to 10), and a natural language task description are required. 2. **Keypoint Extraction**: Candidate regions are generated through large-scale pre-trained models, further refined by an image segmentation model to generate candidate keypoints, and the final set of keypoints is selected through a validation process. 3. **Policy Learning**: Based on the extracted keypoints and their features, a conditional diffusion model is trained to generate robot trajectories relative to the object keypoints. 4. **Inference Process**: In new scenes, the extracted keypoints are detected, robot actions relative to these keypoints are predicted, and then transformed back to the world coordinate system for execution. ### Experimental Results 1. **Simulation Experiments**: Experiments were conducted on the Meta-World simulator to evaluate the data efficiency of 5 tasks (drawer opening, drawer closing, button side press, button top press, lever pull). Results show that KALM outperforms baseline methods with a small amount of demonstration data. 2. **Real-World Experiments**: Experiments were conducted on a Franka robot arm for three tasks (coffee machine handle, drawer opening, pouring into a bowl) to verify KALM's generalization capability under different camera viewpoints and object instances. Experimental results indicate that KALM performs excellently under various conditions. ### Conclusion By combining large-scale pre-trained models and a validation process, KALM successfully extracts task-relevant and cross-instance consistent keypoints, achieving efficient data-driven robot skill learning and demonstrating strong generalization capabilities in the real world.