Abstract:We explore how intermediate policy representations can facilitate generalization by providing guidance on how to perform manipulation tasks. Existing representations such as language, goal images, and trajectory sketches have been shown to be helpful, but these representations either do not provide enough context or provide over-specified context that yields less robust policies. We propose conditioning policies on affordances, which capture the pose of the robot at key stages of the task. Affordances offer expressive yet lightweight abstractions, are easy for users to specify, and facilitate efficient learning by transferring knowledge from large internet datasets. Our method, RT-Affordance, is a hierarchical model that first proposes an affordance plan given the task language, and then conditions the policy on this affordance plan to perform manipulation. Our model can flexibly bridge heterogeneous sources of supervision including large web datasets and robot trajectories. We additionally train our model on cheap-to-collect in-domain affordance images, allowing us to learn new tasks without collecting any additional costly robot trajectories. We show on a diverse set of novel tasks how RT-Affordance exceeds the performance of existing methods by over 50%, and we empirically demonstrate that affordances are robust to novel settings. Videos available at <a class="link-external link-https" href="https://snasiriany.me/rt-affordance" rel="external noopener nofollow">this https URL</a>

What problem does this paper attempt to address?

### What problem does this paper attempt to solve? This paper aims to solve the generalization problem in robotic manipulation tasks, especially how to improve the generalization ability of robots through intermediate representations. Specifically, existing intermediate representation methods such as language, target images, and trajectory sketches, although helpful in guiding robots to perform tasks, either provide insufficient context or are too specific, resulting in difficult learning and less robust strategies. To solve these problems, the author proposes a new intermediate representation method based on **affordances** and applies it to robotic manipulation tasks. Affordances capture the postures of robots at critical stages of tasks, providing expressive and lightweight abstract representations that are easy for users to specify and can promote effective learning from large - scale Internet datasets. #### Specific problems include: 1. **Limitations of existing representation methods**: Language descriptions are usually too brief to provide sufficient operational guidance; target images, although providing detailed final configuration information, are high - dimensional and difficult to learn; methods such as trajectory sketches, although having a certain spatial guiding effect, still lack sufficient operational details. 2. **High cost of data collection**: Traditional robotic data collection methods (such as collecting demonstration data through teleoperation) are very expensive and difficult to scale. 3. **Insufficient generalization ability**: Existing methods perform poorly when dealing with new objects, new scenes, and new tasks, especially in out - of - distribution (OOD) settings. ### Solutions The author proposes the **RT - Affordance** model, which is a hierarchical model. First, it predicts affordance plans according to task languages, and then executes manipulation tasks based on these affordance plans. This method has the following advantages: - **Expressive and lightweight**: Affordances provide precise and concise operational guidance, neither redundant nor lacking in key information. - **Easy to specify**: Users can easily specify affordances through simple language instructions or visual markers. - **Efficient learning**: By combining large - scale Internet datasets and a small number of labeled in - domain images, new tasks can be learned at low cost without additional expensive robotic demonstration data. Through experimental verification, RT - Affordance significantly outperforms existing methods on multiple tasks, especially showing stronger generalization ability when dealing with new objects, new perspectives, and new backgrounds. ### Summary The core problem of this paper is to explore how to improve the generalization ability of robotic manipulation tasks through intermediate representations (especially affordances), so as to achieve more efficient and more robust task execution.

RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Manipulation

Affordance-Centric Policy Learning: Sample Efficient and Generalisable Robot Policy Learning using Affordance-Centric Task Frames

Learning Precise Affordances from Egocentric Videos for Robotic Manipulation

RLAfford: End-to-End Affordance Learning for Robotic Manipulation

RAIL: Robot Affordance Imagination with Large Language Models

HRP: Human Affordances for Robotic Pre-Training

Affordance Learning from Play for Sample-Efficient Policy Learning

Information-driven Affordance Discovery for Efficient Robotic Manipulation

Building Affordance Relations for Robotic Agents - A Review

Affordance-based Robot Manipulation with Flow Matching

AffordDP: Generalizable Diffusion Policy with Transferable Affordance

Learning Foresightful Dense Visual Affordance for Deformable Object Manipulation

Utilization of Affordance by Reinforcement Learning Robot

AdaAfford: Learning to Adapt Manipulation Affordance for 3D Articulated Objects via Few-shot Interactions

UniAff: A Unified Representation of Affordances for Tool Usage and Articulation with Vision-Language Models

RT-H: Action Hierarchies Using Language

Robo-ABC: Affordance Generalization Beyond Categories via Semantic Correspondence for Robot Manipulation

EqvAfford: SE(3) Equivariance for Point-Level Affordance Learning

Transferring Foundation Models for Generalizable Robotic Manipulation

Recognizing Object Affordances to Support Scene Reasoning for Manipulation Tasks

PreAfford: Universal Affordance-Based Pre-Grasping for Diverse Objects and Environments