Abstract:Training an agent to achieve particular goals or perform desired behaviors is often accomplished through reinforcement learning, especially in the absence of expert demonstrations. However, supporting novel goals or behaviors through reinforcement learning requires the ad-hoc design of appropriate reward functions, which quickly becomes intractable. To address this challenge, we propose Text-Aware Diffusion for Policy Learning (TADPoLe), which uses a pretrained, frozen text-conditioned diffusion model to compute dense zero-shot reward signals for text-aligned policy learning. We hypothesize that large-scale pretrained generative models encode rich priors that can supervise a policy to behave not only in a text-aligned manner, but also in alignment with a notion of naturalness summarized from internet-scale training data. In our experiments, we demonstrate that TADPoLe is able to learn policies for novel goal-achievement and continuous locomotion behaviors specified by natural language, in both Humanoid and Dog environments. The behaviors are learned zero-shot without ground-truth rewards or expert demonstrations, and are qualitatively more natural according to human evaluation. We further show that TADPoLe performs competitively when applied to robotic manipulation tasks in the Meta-World environment, without having access to any in-domain demonstrations.

What problem does this paper attempt to address?

The core problem that this paper attempts to solve is: **How to learn new goals or behaviors through natural - language descriptions without expert demonstrations and real - reward functions?** Specifically, the authors propose a method named Text - Aware Diffusion for Policy Learning (TADPoLe), which uses a pre - trained text - conditioned diffusion model to generate dense zero - sample reward signals, thereby guiding policy learning. ### Main Problem Background Traditional Reinforcement Learning (RL) usually relies on carefully designed reward functions to guide agents to learn specific goals or behaviors. However, when faced with novel goals or behaviors, manually designing appropriate reward functions becomes very difficult and not scalable. Especially in the absence of expert demonstrations, designing reward functions that can effectively guide agents to learn complex behaviors is a major challenge. ### TADPoLe Solution TADPoLe solves this problem in the following ways: 1. **Utilizing a pre - trained text - conditioned diffusion model**: - Use large - scale pre - trained generative models (such as StableDiffusion and AnimatedDiff), which have been trained on Internet - scale datasets and can generate images or videos aligned with text. - These models encode rich prior knowledge and can supervise policies to be not only text - aligned but also in line with the naturalness in human perception. 2. **Generating dense zero - sample reward signals**: - By comparing the observations rendered by the environment with the text description, a dense reward signal is calculated. This reward signal does not require any manually designed real - reward functions or expert demonstrations. - The reward signal consists of two parts: - **Alignment Reward**: Measures the degree of alignment between the rendered observations and the text description. - **Reconstruction Reward**: Measures whether the generated frames conform to the natural motion pattern. 3. **Applicable to multiple tasks and environments**: - TADPoLe has demonstrated its effectiveness in multiple environments, including Humanoid, Dog, and Meta - World environments. These environments cover goal - achievement tasks and continuous - motion tasks. - Experiments show that TADPoLe can learn novel zero - sample policies that are flexibly and accurately aligned with natural - language inputs and exhibit more natural behaviors in human evaluations. ### Formula Representation - The calculation formulas for the alignment reward \( r_{\text{align}} \) and the reconstruction reward \( r_{\text{rec}} \) are as follows: \[ r_{\text{align}}^t = \left\| \hat{\epsilon}_\phi(\tilde{o}_{t + 1}; t_{\text{noise}}, y) - \hat{\epsilon}_\phi(\tilde{o}_{t + 1}; t_{\text{noise}}) \right\|_2^2 \] \[ r_{\text{rec}}^t = \left\| \hat{\epsilon}_\phi(\tilde{o}_{t + 1}; t_{\text{noise}}) - \epsilon_0 \right\|_2^2 - \left\| \hat{\epsilon}_\phi(\tilde{o}_{t + 1}; t_{\text{noise}}, y) - \epsilon_0 \right\|_2^2 \] - The final reward signal \( r_t \) is composed of the alignment reward and the reconstruction reward and is subjected to symlog transformation: \[ r_t = \text{symlog}(w_1 \cdot r_{\text{align}}^t+w_2 \cdot r_{\text{rec}}^t) \]

Text-Aware Diffusion for Policy Learning

Learning an Actionable Discrete Diffusion Policy via Large-Scale Actionless Video Pre-Training

Text2Reward: Reward Shaping with Language Models for Reinforcement Learning

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration

PDP: Physics-Based Character Animation via Diffusion Policy

Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot Learning

Prediction with Action: Visual Policy Learning via Joint Denoising Process

DNAct: Diffusion Guided Multi-Task 3D Policy Learning

Diffusion Co-Policy for Synergistic Human-Robot Collaborative Tasks

Aligning Diffusion Behaviors with Q-functions for Efficient Continuous Control

Enhancing Exploration with Diffusion Policies in Hybrid Off-Policy RL: Application to Non-Prehensile Manipulation

Latent Weight Diffusion: Generating Policies from Trajectories

Score Regularized Policy Optimization Through Diffusion Behavior

Pre-trained Text-to-Image Diffusion Models Are Versatile Representation Learners for Control

Diffusion Transformer Policy

3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

Learning Multimodal Behaviors from Scratch with Diffusion Policy Gradient

Hierarchical Diffusion Policy: manipulation trajectory generation via contact guidance