Abstract:Using learned reward functions (LRFs) as a means to solve sparse-reward reinforcement learning (RL) tasks has yielded some steady progress in task-complexity through the years. In this work, we question whether today's LRFs are best-suited as a direct replacement for task rewards. Instead, we propose leveraging the capabilities of LRFs as a pretraining signal for RL. Concretely, we propose $\textbf{LA}$nguage Reward $\textbf{M}$odulated $\textbf{P}$retraining (LAMP) which leverages the zero-shot capabilities of Vision-Language Models (VLMs) as a $\textit{pretraining}$ utility for RL as opposed to a downstream task reward. LAMP uses a frozen, pretrained VLM to scalably generate noisy, albeit shaped exploration rewards by computing the contrastive alignment between a highly diverse collection of language instructions and the image observations of an agent in its pretraining environment. LAMP optimizes these rewards in conjunction with standard novelty-seeking exploration rewards with reinforcement learning to acquire a language-conditioned, pretrained policy. Our VLM pretraining approach, which is a departure from previous attempts to use LRFs, can warmstart sample-efficient learning on robot manipulation tasks in RLBench.

What problem does this paper attempt to address?

### What problems does this paper attempt to solve? This paper aims to solve the problems of the definition of the reward function and pretraining in Reinforcement Learning (RL). Specifically, the paper explores how to use large - scale pretrained Vision - Language Models (VLMs) to generate diverse intrinsic reward signals to guide RL agents to conduct effective exploration during the pretraining stage. Through this method, the learning of downstream tasks can be accelerated and the sample complexity can be reduced. #### Main problems include: 1. **Definition of the reward function**: - Manually designing the reward function requires a great deal of domain knowledge and debugging, and these reward functions are usually difficult to interpret, full of complex mathematical expressions and constants. - Manually designed reward functions are often over - customized for specific domains and difficult to generalize to new agents or environments. - Learned Reward Functions (LRFs), although they can learn from demonstration data, are also vulnerable to noise and mis - specified rewards, resulting in learned policies that are not robust enough. 2. **Effectiveness of pretraining**: - During the pretraining stage, how to generate diverse behaviors to cover a wide range of tasks, so as to provide better initial policies for downstream tasks. - How to use the zero - sample generalization ability of VLMs to generate diverse reward signals during the pretraining process to promote semantically meaningful exploration behaviors. #### Solutions: The authors propose the **LAnguage Reward Modulated Pretraining (LAMP)** method. Its core ideas are: - Use the frozen pretrained VLM to generate intrinsic reward signals based on language instructions. These reward signals can be obtained by calculating the contrastive alignment between image observations and language descriptions. - Combine these intrinsic rewards with novelty - based exploration rewards (such as the rewards in the Plan2Explore algorithm) to optimize the policies of RL agents, enabling them to efficiently explore the environment and learn diverse skills. - The pretrained policies can be fine - tuned in downstream tasks to quickly adapt to the requirements of specific tasks. In this way, LAMP not only solves the problem of manually designing reward functions, but also improves the efficiency and generalization ability in the pretraining stage, enabling RL agents to learn more effectively in complex environments. ### Formula summary - **Intrinsic reward formula**: \[ r_{\text{LAMP}}^i = G_\theta(F_\phi(o_1), F_\phi(o_i), L_\alpha(x)) \] where $ G_\theta $ is the R3M score predictor, $ F_\phi $ is the visual feature encoder, $ L_\alpha $ is the language encoder, and $ x $ is a natural language instruction. - **Combined reward formula**: \[ r_{\text{pre}}^i = \alpha \cdot r_{\text{P2E}}^i+(1 - \alpha)\cdot r_{\text{LAMP}}^i \] where $ r_{\text{P2E}}^i $ is the novelty - based exploration reward and $ \alpha $ is a balancing parameter. ### Conclusion The LAMP method successfully generates diverse intrinsic reward signals by using the flexibility and zero - sample generalization ability of VLMs, promotes efficient exploration in the pretraining stage, and significantly reduces the sample complexity of downstream tasks. This provides a promising new approach for RL agent pretraining in complex environments.

Language Reward Modulation for Pretraining Reinforcement Learning

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Code as Reward: Empowering Reinforcement Learning with VLMs

RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback

Unleashing the Power of Pre-trained Language Models for Offline Reinforcement Learning

FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning

Exploring RL-based LLM Training for Formal Language Tasks with Programmed Rewards

From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models

Words as Beacons: Guiding RL Agents with High-Level Language Prompts

RLHF Deciphered: A Critical Analysis of Reinforcement Learning from Human Feedback for LLMs

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Vision-Language Models as a Source of Rewards

Value Augmented Sampling for Language Model Alignment and Personalization

Using Natural Language for Reward Shaping in Reinforcement Learning

Learning Goal-Conditioned Representations for Language Reward Models

Reward Design with Language Models

Countering Reward Over-optimization in LLM with Demonstration-Guided Reinforcement Learning

Accelerating Reinforcement Learning of Robotic Manipulations via Feedback from Large Language Models

REvolve: Reward Evolution with Large Language Models using Human Feedback

Large Language Models as Generalizable Policies for Embodied Tasks