From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models

Theo Cachet,Christopher R. Dance,Olivier Sigaud
2024-11-26
Abstract:Vision-language models (VLMs) have tremendous potential for grounding language, and thus enabling language-conditioned agents (LCAs) to perform diverse tasks specified with text. This has motivated the study of LCAs based on reinforcement learning (RL) with rewards given by rendering images of an environment and evaluating those images with VLMs. If single-task RL is employed, such approaches are limited by the cost and time required to train a policy for each new task. Multi-task RL (MTRL) is a natural alternative, but requires a carefully designed corpus of training tasks and does not always generalize reliably to new tasks. Therefore, this paper introduces a novel decomposition of the problem of building an LCA: first find an environment configuration that has a high VLM score for text describing a task; then use a (pretrained) goal-conditioned policy to reach that configuration. We also explore several enhancements to the speed and quality of VLM-based LCAs, notably, the use of distilled models, and the evaluation of configurations from multiple viewpoints to resolve the ambiguities inherent in a single 2D view. We demonstrate our approach on the Humanoid environment, showing that it results in LCAs that outperform MTRL baselines in zero-shot generalization, without requiring any textual task descriptions or other forms of environment-specific annotation during training. Videos and an interactive demo can be found at <a class="link-external link-https" href="https://europe.naverlabs.com/text2control" rel="external noopener nofollow">this https URL</a>
Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to construct language - conditioned agents (LCAs) capable of performing text - instruction - based tasks in the absence of a large amount of labeled data. Specifically, the author explores how to utilize vision - language models (VLMs) to achieve this goal. Traditional single - task reinforcement learning (STRL) methods require training strategies separately for each new task, which is not only time - consuming but also costly. Although multi - task reinforcement learning (MTRL) can handle multiple tasks, it requires a carefully designed task training set and may not be able to reliably generalize to new tasks. To this end, the paper proposes a new decomposition method to construct LCAs: first, find an environmental configuration such that the environmental image under this configuration has a high VLM score for the text describing the task; then use the pre - trained goal - conditioned reinforcement learning (GCRL) to reach this configuration. This method not only avoids the need to train strategies separately for each new task, but also achieves zero - sample generalization ability for unseen tasks by using pre - trained GCRL agents. In addition, the paper also explores various methods to enhance the speed and quality of VLM - based LCAs, including using distillation models and evaluating configurations from multiple perspectives to solve the ambiguity problem inherent in a single 2D view. Experimental results show that the proposed method outperforms the MTRL baseline in zero - sample generalization performance and also has a significant improvement in computational efficiency.