Abstract:Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or learning a reward model from a large amount of human feedback, which is often very expensive. We study a more sample-efficient alternative: using pretrained vision-language models (VLMs) as zero-shot reward models (RMs) to specify tasks via natural language. We propose a natural and general approach to using VLMs as reward models, which we call VLM-RMs. We use VLM-RMs based on CLIP to train a MuJoCo humanoid to learn complex tasks without a manually specified reward function, such as kneeling, doing the splits, and sitting in a lotus position. For each of these tasks, we only provide a single sentence text prompt describing the desired task with minimal prompt engineering. We provide videos of the trained agents at: <a class="link-external link-https" href="https://sites.google.com/view/vlm-rm" rel="external noopener nofollow">this https URL</a>. We can improve performance by providing a second "baseline" prompt and projecting out parts of the CLIP embedding space irrelevant to distinguish between goal and baseline. Further, we find a strong scaling effect for VLM-RMs: larger VLMs trained with more compute and data are better reward models. The failure modes of VLM-RMs we encountered are all related to known capability limitations of current VLMs, such as limited spatial reasoning ability or visually unrealistic environments that are far off-distribution for the VLM. We find that VLM-RMs are remarkably robust as long as the VLM is large enough. This suggests that future VLMs will become more and more useful reward models for a wide range of RL applications.

FuRL: Visual-Language Models as Fuzzy Rewards for Reinforcement Learning

RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback

Code as Reward: Empowering Reinforcement Learning with VLMs

Vision-Language Models as a Source of Rewards

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Language Reward Modulation for Pretraining Reinforcement Learning

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Fine-Grained Human Feedback Gives Better Rewards for Language Model Training

The Accuracy Paradox in RLHF: When Better Reward Models Don't Yield Better Language Models

Optimizing Language Models with Fair and Stable Reward Composition in Reinforcement Learning

Secrets of RLHF in Large Language Models Part II: Reward Modeling

Text2Reward: Reward Shaping with Language Models for Reinforcement Learning

Beyond Sparse Rewards: Enhancing Reinforcement Learning with Language Model Critique in Text Generation

Words as Beacons: Guiding RL Agents with High-Level Language Prompts

Fine-Tuning Language Models with Reward Learning on Policy

Enhancing Reinforcement Learning with Label-Sensitive Reward for Natural Language Understanding

RoVRM: A Robust Visual Reward Model Optimized via Auxiliary Textual Preference Data

Shattering the Agent-Environment Interface for Fine-Tuning Inclusive Language Models

LongReward: Improving Long-context Large Language Models with AI Feedback

Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics

Towards Socially and Morally Aware RL agent: Reward Design With LLM