Generative Image as Action Models

Mohit Shridhar,Yat Long Lo,Stephen James
2024-10-09
Abstract:Image-generation diffusion models have been fine-tuned to unlock new capabilities such as image-editing and novel view synthesis. Can we similarly unlock image-generation models for visuomotor control? We present GENIMA, a behavior-cloning agent that fine-tunes Stable Diffusion to 'draw joint-actions' as targets on RGB images. These images are fed into a controller that maps the visual targets into a sequence of joint-positions. We study GENIMA on 25 RLBench and 9 real-world manipulation tasks. We find that, by lifting actions into image-space, internet pre-trained diffusion models can generate policies that outperform state-of-the-art visuomotor approaches, especially in robustness to scene perturbations and generalizing to novel objects. Our method is also competitive with 3D agents, despite lacking priors such as depth, keypoints, or motion-planners.
Robotics,Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is how to use the image generation model to generate robot actions in order to achieve visual - motion control. Specifically, the author proposes a behavior cloning agent named GENIMA, which fine - tunes the Stable Diffusion model to "draw joint actions" as targets on RGB images. These images are then input into a controller, which maps the visual targets into a series of joint positions. The main contributions of the paper are as follows: 1. **Propose a new problem formulation method**: Regarding joint action generation as an image generation problem enables the Internet pre - trained diffusion model to learn action patterns as visual patterns. 2. **Provide a proof - of - concept system**: This system can draw and execute joint actions. 3. **Show empirical results and insights**: Through simulation and real - world experiments, it is proved that GENIMA has the ability to handle scene perturbations and generalize to new objects. Especially without using prior information such as depth, key points or motion planners, its performance is close to that of 3D agents. The research background of the paper is that the existing robot learning methods have limitations in dealing with complex actions and scene changes. For example, although 3D - based agents perform well on specific tasks, they rely on prior information such as depth cameras, key points, task - specific scene boundaries and motion planners. GENIMA avoids relying on this prior information by representing actions as targets in images, thereby improving the generalization ability and robustness of the model. The paper verifies the effectiveness of GENIMA through experiments on 25 RLBench tasks and 9 real - world tasks. The experimental results show that GENIMA not only outperforms the existing state - of - the - art methods on multiple tasks, but also performs excellently in handling scene perturbations (such as random object colors, distractors, illumination changes, etc.) and generalizing to new objects.