RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation

Soroush Nasiriany,Sean Kirmani,Tianli Ding,Laura Smith,Yuke Zhu,Danny Driess,Dorsa Sadigh,Ted Xiao
2024-11-05
Abstract:We explore how intermediate policy representations can facilitate generalization by providing guidance on how to perform manipulation tasks. Existing representations such as language, goal images, and trajectory sketches have been shown to be helpful, but these representations either do not provide enough context or provide over-specified context that yields less robust policies. We propose conditioning policies on affordances, which capture the pose of the robot at key stages of the task. Affordances offer expressive yet lightweight abstractions, are easy for users to specify, and facilitate efficient learning by transferring knowledge from large internet datasets. Our method, RT-Affordance, is a hierarchical model that first proposes an affordance plan given the task language, and then conditions the policy on this affordance plan to perform manipulation. Our model can flexibly bridge heterogeneous sources of supervision including large web datasets and robot trajectories. We additionally train our model on cheap-to-collect in-domain affordance images, allowing us to learn new tasks without collecting any additional costly robot trajectories. We show on a diverse set of novel tasks how RT-Affordance exceeds the performance of existing methods by over 50%, and we empirically demonstrate that affordances are robust to novel settings. Videos available at <a class="link-external link-https" href="https://snasiriany.me/rt-affordance" rel="external noopener nofollow">this https URL</a>
Robotics,Artificial Intelligence,Computation and Language,Computer Vision and Pattern Recognition,Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the generalization problem in robotic manipulation tasks, especially how to improve the generalization ability of robots through intermediate representations. Specifically, existing intermediate representation methods such as language, target images, and trajectory sketches, although helpful in guiding robots to perform tasks, either provide insufficient context or are too specific, resulting in difficult learning and less robust strategies. To solve these problems, the author proposes a new intermediate representation method based on **affordances** and applies it to robotic manipulation tasks. Affordances capture the postures of robots at critical stages of tasks, providing expressive and lightweight abstract representations that are easy for users to specify and can promote effective learning from large - scale Internet datasets. #### Specific problems include: 1. **Limitations of existing representation methods**: Language descriptions are usually too brief to provide sufficient operational guidance; target images, although providing detailed final configuration information, are high - dimensional and difficult to learn; methods such as trajectory sketches, although having a certain spatial guiding effect, still lack sufficient operational details. 2. **High cost of data collection**: Traditional robotic data collection methods (such as collecting demonstration data through teleoperation) are very expensive and difficult to scale. 3. **Insufficient generalization ability**: Existing methods perform poorly when dealing with new objects, new scenes, and new tasks, especially in out - of - distribution (OOD) settings. ### Solutions The author proposes the **RT - Affordance** model, which is a hierarchical model. First, it predicts affordance plans according to task languages, and then executes manipulation tasks based on these affordance plans. This method has the following advantages: - **Expressive and lightweight**: Affordances provide precise and concise operational guidance, neither redundant nor lacking in key information. - **Easy to specify**: Users can easily specify affordances through simple language instructions or visual markers. - **Efficient learning**: By combining large - scale Internet datasets and a small number of labeled in - domain images, new tasks can be learned at low cost without additional expensive robotic demonstration data. Through experimental verification, RT - Affordance significantly outperforms existing methods on multiple tasks, especially showing stronger generalization ability when dealing with new objects, new perspectives, and new backgrounds. ### Summary The core problem of this paper is to explore how to improve the generalization ability of robotic manipulation tasks through intermediate representations (especially affordances), so as to achieve more efficient and more robust task execution.