Text-Aware Diffusion for Policy Learning

Calvin Luo,Mandy He,Zilai Zeng,Chen Sun
2024-11-01
Abstract:Training an agent to achieve particular goals or perform desired behaviors is often accomplished through reinforcement learning, especially in the absence of expert demonstrations. However, supporting novel goals or behaviors through reinforcement learning requires the ad-hoc design of appropriate reward functions, which quickly becomes intractable. To address this challenge, we propose Text-Aware Diffusion for Policy Learning (TADPoLe), which uses a pretrained, frozen text-conditioned diffusion model to compute dense zero-shot reward signals for text-aligned policy learning. We hypothesize that large-scale pretrained generative models encode rich priors that can supervise a policy to behave not only in a text-aligned manner, but also in alignment with a notion of naturalness summarized from internet-scale training data. In our experiments, we demonstrate that TADPoLe is able to learn policies for novel goal-achievement and continuous locomotion behaviors specified by natural language, in both Humanoid and Dog environments. The behaviors are learned zero-shot without ground-truth rewards or expert demonstrations, and are qualitatively more natural according to human evaluation. We further show that TADPoLe performs competitively when applied to robotic manipulation tasks in the Meta-World environment, without having access to any in-domain demonstrations.
Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is: **How to learn new goals or behaviors through natural - language descriptions without expert demonstrations and real - reward functions?** Specifically, the authors propose a method named Text - Aware Diffusion for Policy Learning (TADPoLe), which uses a pre - trained text - conditioned diffusion model to generate dense zero - sample reward signals, thereby guiding policy learning. ### Main Problem Background Traditional Reinforcement Learning (RL) usually relies on carefully designed reward functions to guide agents to learn specific goals or behaviors. However, when faced with novel goals or behaviors, manually designing appropriate reward functions becomes very difficult and not scalable. Especially in the absence of expert demonstrations, designing reward functions that can effectively guide agents to learn complex behaviors is a major challenge. ### TADPoLe Solution TADPoLe solves this problem in the following ways: 1. **Utilizing a pre - trained text - conditioned diffusion model**: - Use large - scale pre - trained generative models (such as StableDiffusion and AnimatedDiff), which have been trained on Internet - scale datasets and can generate images or videos aligned with text. - These models encode rich prior knowledge and can supervise policies to be not only text - aligned but also in line with the naturalness in human perception. 2. **Generating dense zero - sample reward signals**: - By comparing the observations rendered by the environment with the text description, a dense reward signal is calculated. This reward signal does not require any manually designed real - reward functions or expert demonstrations. - The reward signal consists of two parts: - **Alignment Reward**: Measures the degree of alignment between the rendered observations and the text description. - **Reconstruction Reward**: Measures whether the generated frames conform to the natural motion pattern. 3. **Applicable to multiple tasks and environments**: - TADPoLe has demonstrated its effectiveness in multiple environments, including Humanoid, Dog, and Meta - World environments. These environments cover goal - achievement tasks and continuous - motion tasks. - Experiments show that TADPoLe can learn novel zero - sample policies that are flexibly and accurately aligned with natural - language inputs and exhibit more natural behaviors in human evaluations. ### Formula Representation - The calculation formulas for the alignment reward \( r_{\text{align}} \) and the reconstruction reward \( r_{\text{rec}} \) are as follows: \[ r_{\text{align}}^t = \left\| \hat{\epsilon}_\phi(\tilde{o}_{t + 1}; t_{\text{noise}}, y) - \hat{\epsilon}_\phi(\tilde{o}_{t + 1}; t_{\text{noise}}) \right\|_2^2 \] \[ r_{\text{rec}}^t = \left\| \hat{\epsilon}_\phi(\tilde{o}_{t + 1}; t_{\text{noise}}) - \epsilon_0 \right\|_2^2 - \left\| \hat{\epsilon}_\phi(\tilde{o}_{t + 1}; t_{\text{noise}}, y) - \epsilon_0 \right\|_2^2 \] - The final reward signal \( r_t \) is composed of the alignment reward and the reconstruction reward and is subjected to symlog transformation: \[ r_t = \text{symlog}(w_1 \cdot r_{\text{align}}^t+w_2 \cdot r_{\text{rec}}^t) \]