Abstract:The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a multi-embodiment, multi-task generalist agent for robotic manipulation. This agent, named RoboCat, is a visual goal-conditioned decision transformer capable of consuming action-labelled visual experience. This data spans a large repertoire of motor control skills from simulated and real robotic arms with varying sets of observations and actions. With RoboCat, we demonstrate the ability to generalise to new tasks and robots, both zero-shot as well as through adaptation using only 100-1000 examples for the target task. We also show how a trained model itself can be used to generate data for subsequent training iterations, thus providing a basic building block for an autonomous improvement loop. We investigate the agent's capabilities, with large-scale evaluations both in simulation and on three different real robot embodiments. We find that as we grow and diversify its training data, RoboCat not only shows signs of cross-task transfer, but also becomes more efficient at adapting to new tasks.

What problem does this paper attempt to address?

### Problems the paper attempts to solve The paper "RoboCat: A Self - Improving Generalist Agent for Robotic Manipulation" attempts to solve the following problems: 1. **Multi - task and multi - robot adaptation**: - Most of the robot learning research in the real world focuses on one task at a time because the cost of designing tasks and generating robot experiences is very high. However, using large - scale heterogeneous robot data to quickly master new skills and adapt to new robots remains a challenge in the field of robotics. - This paper proposes a multi - task, multi - robot general - purpose agent (RoboCat), aiming to quickly master new skills and adapt to new robots by leveraging the heterogeneous experiences of different robots and tasks. 2. **Decision - making under visual goal conditions**: - RoboCat is a decision transformer based on visual goal conditions and is able to process visual experiences with action labels. These data cover various motion control skills obtained from simulated and real robot arms, with different sets of observations and actions. - Through visual goal conditions, RoboCat can adapt to new tasks and robots in the zero - sample or few - shot (100 - 1000 examples) cases. 3. **Self - improvement ability**: - The trained model itself can be used to generate data required for subsequent training, thus providing a basic autonomous improvement loop. - Through this self - improvement process, RoboCat can not only transfer across tasks, but also adapt to new tasks more efficiently and show better performance on existing tasks. 4. **Large - scale evaluation**: - The authors conducted a large - scale evaluation of RoboCat's capabilities, including experiments in a simulated environment and on three different real - robot instances. - The results show that as the training data increases and diversifies, RoboCat not only shows the ability to transfer across tasks, but also can adapt to new tasks more efficiently. ### Main contributions 1. **For the first time, show that large - scale Transformer sequence models can solve a large number of dexterous tasks on multiple real - robot instances**. 2. **By using a small amount of expert demonstration data, study RoboCat's ability to adapt to unseen tasks, reducing the threshold for learning new skills**. 3. **Demonstrate a simple and effective self - improvement process for reintegrating these skills into a general - purpose agent**. 4. **By expanding and enriching the training data, RoboCat performs better on training tasks and is more efficient when fine - tuning new tasks**. ### Method overview - **Training phase**: - Use the VQ - GAN encoder to pre - process images, and then use large - scale diverse task and robot data to train RoboCat. - Tasks are specified by visual goal conditions, and each task is defined by the set of its valid start and end states. - **Fine - tuning and self - improvement**: - Collect 100 - 1000 expert demonstration data for each task to fine - tune RoboCat to adapt to new tasks. - Deploy the fine - tuned policy to autonomously collect more data, which is used to train a new version of RoboCat. - **Actual deployment**: - Deploy the fine - tuned policy on real robots to collect large - scale data for new tasks. - Solve the problems of success detection and task reset in autonomous data collection, and use the reward model and policy pool to achieve automatic reset. ### Experimental setup - **Robot instances**: - Include simulated and real - world Sawyer and Panda robotic arms, as well as the KUKA 14 - DoF robotic arm. - Each robotic arm is equipped with different grippers, and the KUKA robotic arm uses a custom - made three - finger gripper. - **Tasks and object sets**: - Include various tasks such as structure building, insertion, and lifting, and use multiple real - object sets such as RGB objects, NIST - i gears, YCB fruits and vegetables, etc. - **Data sources**: - Include expert data (data generated by RL - trained agents)

RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation

RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking

Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations

A Generalist Agent

Rapid Motor Adaptation for Robotic Manipulator Arms

Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

RT-1: Robotics Transformer for Real-World Control at Scale

Pushing the Limits of Cross-Embodiment Learning for Manipulation and Navigation

Learning to combine primitive skills: A step towards versatile robotic manipulation

Learning Robotic Manipulation through Visual Planning and Acting

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Open-World Object Manipulation using Pre-trained Vision-Language Models

Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

RoboMM: All-in-One Multimodal Large Model for Robotic Manipulation

RoboMIND: Benchmark on Multi-embodiment Intelligence Normative Data for Robot Manipulation

RH20T-P: A Primitive-Level Robotic Dataset Towards Composable Generalization Agents

Programmatically Grounded, Compositionally Generalizable Robotic Manipulation

$π_0$: A Vision-Language-Action Flow Model for General Robot Control

RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets