Abstract:Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics. Keypoint-based representations have been proven effective as a succinct representation for capturing essential object features, and for establishing a reference frame in action prediction, enabling data-efficient learning of robot skills. However, their manual design nature and reliance on additional human labels limit their scalability. In this paper, we propose KALM, a framework that leverages large pre-trained vision-language models (LMs) to automatically generate task-relevant and cross-instance consistent keypoints. KALM distills robust and consistent keypoints across views and objects by generating proposals using LMs and verifies them against a small set of robot demonstration data. Based on the generated keypoints, we can train keypoint-conditioned policy models that predict actions in keypoint-centric frames, enabling robots to generalize effectively across varying object poses, camera views, and object instances with similar functional shapes. Our method demonstrates strong performance in the real world, adapting to different tasks and environments from only a handful of demonstrations while requiring no additional labels. Website: <a class="link-external link-https" href="https://kalm-il.github.io/" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### The Problem the Paper Attempts to Solve This paper aims to address a key challenge in robotics: the ability to generalize to new object configurations and instances across different tasks and environments. Specifically, the authors focus on how to achieve efficient data-driven robot skill learning through keypoint representation. Keypoint representation has been proven to be an effective and concise method that captures the essential features of objects and establishes a reference frame in action prediction, thereby enabling data-efficient robot skill learning. However, manually designing keypoints and their reliance on additional human labels limit the scalability of this approach. To this end, the paper proposes a framework named KALM (Keypoint Abstraction using Large Models for Object-Relative Imitation Learning), which leverages large-scale pre-trained vision-language models (LMs) to automatically generate task-relevant and cross-instance consistent keypoints. KALM extracts robust and consistent keypoints by generating proposals and validating them, and then trains a conditional policy model based on these keypoints to predict keypoint-centric actions. This allows robots to effectively generalize across different object poses, camera viewpoints, and object instances with similar functional shapes. ### Main Contributions 1. **Keypoint Extraction**: A method combining proposal and validation processes is proposed to extract task-relevant and cross-instance consistent keypoints from large-scale pre-trained models. 2. **Action Representation**: Based on the extracted keypoints and their features, a keypoint-centric, object-relative action representation is constructed, which can be learned from a few demonstrations through a diffusion policy model. 3. **Generalization Capability**: Experiments demonstrate that the proposed framework exhibits strong generalization capabilities in the real world, adapting to changes in different tasks and environments without requiring additional labels. ### Method Overview 1. **Problem Definition**: For each skill, a successful task execution video, a few demonstration trajectories (5 to 10), and a natural language task description are required. 2. **Keypoint Extraction**: Candidate regions are generated through large-scale pre-trained models, further refined by an image segmentation model to generate candidate keypoints, and the final set of keypoints is selected through a validation process. 3. **Policy Learning**: Based on the extracted keypoints and their features, a conditional diffusion model is trained to generate robot trajectories relative to the object keypoints. 4. **Inference Process**: In new scenes, the extracted keypoints are detected, robot actions relative to these keypoints are predicted, and then transformed back to the world coordinate system for execution. ### Experimental Results 1. **Simulation Experiments**: Experiments were conducted on the Meta-World simulator to evaluate the data efficiency of 5 tasks (drawer opening, drawer closing, button side press, button top press, lever pull). Results show that KALM outperforms baseline methods with a small amount of demonstration data. 2. **Real-World Experiments**: Experiments were conducted on a Franka robot arm for three tasks (coffee machine handle, drawer opening, pouring into a bowl) to verify KALM's generalization capability under different camera viewpoints and object instances. Experimental results indicate that KALM performs excellently under various conditions. ### Conclusion By combining large-scale pre-trained models and a validation process, KALM successfully extracts task-relevant and cross-instance consistent keypoints, achieving efficient data-driven robot skill learning and demonstrating strong generalization capabilities in the real world.

Keypoint Abstraction using Large Models for Object-Relative Imitation Learning

KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data

K-VIL: Keypoints-based Visual Imitation Learning

Self-Supervised Learning of Multi-Object Keypoints for Robotic Manipulation

Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs

Affordance-Guided Reinforcement Learning via Visual Prompting

RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models

Long-horizon Locomotion and Manipulation on a Quadrupedal Robot with Large Language Models

ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation

MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting

Bi-KVIL: Keypoints-based Visual Imitation Learning of Bimanual Manipulation Tasks

Kalib: Markerless Hand-Eye Calibration with Keypoint Tracking

Scaling Manipulation Learning with Visual Kinematic Chain Prediction

Task and Motion Planning with Large Language Models for Object Rearrangement

Robot Skill Generalization via Keypoint Integrated Soft Actor-Critic Gaussian Mixture Models

SKT: Integrating State-Aware Keypoint Trajectories with Vision-Language Models for Robotic Garment Manipulation

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

What's the Move? Hybrid Imitation Learning via Salient Points

Preference-Conditioned Language-Guided Abstraction

Leveraging Commonsense Knowledge from Large Language Models for Task and Motion Planning

Learning Generalizable Dexterous Manipulation from Human Grasp Affordance