From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models

Theo Cachet,Christopher R. Dance,Olivier Sigaud

2024-11-26

Abstract:Vision-language models (VLMs) have tremendous potential for grounding language, and thus enabling language-conditioned agents (LCAs) to perform diverse tasks specified with text. This has motivated the study of LCAs based on reinforcement learning (RL) with rewards given by rendering images of an environment and evaluating those images with VLMs. If single-task RL is employed, such approaches are limited by the cost and time required to train a policy for each new task. Multi-task RL (MTRL) is a natural alternative, but requires a carefully designed corpus of training tasks and does not always generalize reliably to new tasks. Therefore, this paper introduces a novel decomposition of the problem of building an LCA: first find an environment configuration that has a high VLM score for text describing a task; then use a (pretrained) goal-conditioned policy to reach that configuration. We also explore several enhancements to the speed and quality of VLM-based LCAs, notably, the use of distilled models, and the evaluation of configurations from multiple viewpoints to resolve the ambiguities inherent in a single 2D view. We demonstrate our approach on the Humanoid environment, showing that it results in LCAs that outperform MTRL baselines in zero-shot generalization, without requiring any textual task descriptions or other forms of environment-specific annotation during training. Videos and an interactive demo can be found at <a class="link-external link-https" href="https://europe.naverlabs.com/text2control" rel="external noopener nofollow">this https URL</a>

Artificial Intelligence

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to construct language - conditioned agents (LCAs) capable of performing text - instruction - based tasks in the absence of a large amount of labeled data. Specifically, the author explores how to utilize vision - language models (VLMs) to achieve this goal. Traditional single - task reinforcement learning (STRL) methods require training strategies separately for each new task, which is not only time - consuming but also costly. Although multi - task reinforcement learning (MTRL) can handle multiple tasks, it requires a carefully designed task training set and may not be able to reliably generalize to new tasks. To this end, the paper proposes a new decomposition method to construct LCAs: first, find an environmental configuration such that the environmental image under this configuration has a high VLM score for the text describing the task; then use the pre - trained goal - conditioned reinforcement learning (GCRL) to reach this configuration. This method not only avoids the need to train strategies separately for each new task, but also achieves zero - sample generalization ability for unseen tasks by using pre - trained GCRL agents. In addition, the paper also explores various methods to enhance the speed and quality of VLM - based LCAs, including using distillation models and evaluating configurations from multiple perspectives to solve the ambiguity problem inherent in a single 2D view. Experimental results show that the proposed method outperforms the MTRL baseline in zero - sample generalization performance and also has a significant improvement in computational efficiency.

From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models

Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning

Vision-Language Models as a Source of Rewards

RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Vision-Language Models Provide Promptable Representations for Reinforcement Learning

An Introduction to Vision-Language Modeling

Code as Reward: Empowering Reinforcement Learning with VLMs

LLaRA: Supercharging Robot Learning Data for Vision-Language Policy

Large Language Models as Generalizable Policies for Embodied Tasks

Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics

OpenVLA: An Open-Source Vision-Language-Action Model

Game On: Towards Language Models as RL Experimenters

VLMimic: Vision Language Models are Visual Imitation Learner for Fine-grained Actions

Boosting Efficient Reinforcement Learning for Vision-and-Language Navigation with Open-Sourced LLM

Distilling Internet-Scale Vision-Language Models into Embodied Agents

Vision-Language Navigation Policy Learning and Adaptation

Reinforcement Learning Friendly Vision-Language Model for Minecraft

CogACT: A Foundational Vision-Language-Action Model for Synergizing Cognition and Action in Robotic Manipulation

Grounding Language with Visual Affordances over Unstructured Data

TinyVLA: Towards Fast, Data-Efficient Vision-Language-Action Models for Robotic Manipulation