Abstract:Generalization is a pivotal challenge for agents following natural language instructions. To approach this goal, we leverage a vision-language model (VLM) for visual grounding and transfer its vision-language knowledge into reinforcement learning (RL) for object-centric tasks, which makes the agent capable of zero-shot generalization to unseen objects and instructions. By visual grounding, we obtain an object-grounded confidence map for the target object indicated in the instruction. Based on this map, we introduce two routes to transfer VLM knowledge into RL. Firstly, we propose an object-grounded intrinsic reward function derived from the confidence map to more effectively guide the agent towards the target object. Secondly, the confidence map offers a more unified, accessible task representation for the agent's policy, compared to language embeddings. This enables the agent to process unseen objects and instructions through comprehensible visual confidence maps, facilitating zero-shot object-level generalization. Single-task experiments prove that our intrinsic reward significantly improves performance on challenging skill learning. In multi-task experiments, through testing on tasks beyond the training set, we show that the agent, when provided with the confidence map as the task representation, possesses better generalization capabilities than language-based conditioning. The code is available at <a class="link-external link-https" href="https://github.com/PKU-RL/COPL" rel="external noopener nofollow">this https URL</a>.

Pre-trained Word Embeddings for Goal-conditional Transfer Learning in Reinforcement Learning

Learning Efficient Representations for Goal-conditioned Reinforcement Learning Via Tabu Search

Visual Grounding for Object-Level Generalization in Reinforcement Learning

Grounding Language for Transfer in Deep Reinforcement Learning

Deep Reinforcement Learning for Autonomous Driving by Transferring Visual Features.

Pre-Training Goal-based Models for Sample-Efficient Reinforcement Learning.

Self-Adapting Goals Allow Transfer of Predictive Models to New Tasks

Self-Supervised Reinforcement Learning that Transfers using Random Features

On-Robot Reinforcement Learning with Goal-Contrastive Rewards

Goal exploration augmentation via pre-trained skills for sparse-reward long-horizon goal-conditioned reinforcement learning

From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following

Guiding Pretraining in Reinforcement Learning with Large Language Models

Jointly Pre-training with Supervised, Autoencoder, and Value Losses for Deep Reinforcement Learning

Learning To Walk With Prior Knowledge

Teacher-student curriculum learning for reinforcement learning

Learning Action-Transferable Policy with Action Embedding

Mutual Information Based Knowledge Transfer Under State-Action Dimension Mismatch

Become a Proficient Player with Limited Data through Watching Pure Videos

Multi-Agent Transfer Learning via Temporal Contrastive Learning

Pre-trained Visual Dynamics Representations for Efficient Policy Learning

State Space Decomposition and Subgoal Creation for Transfer in Deep Reinforcement Learning