Guide Your Agent with Adaptive Multimodal Rewards

Changyeon Kim,Younggyo Seo,Hao Liu,Lisa Lee,Jinwoo Shin,Honglak Lee,Kimin Lee

2023-10-25

Abstract:Developing an agent capable of adapting to unseen environments remains a difficult challenge in imitation learning. This work presents Adaptive Return-conditioned Policy (ARP), an efficient framework designed to enhance the agent's generalization ability using natural language task descriptions and pre-trained multimodal encoders. Our key idea is to calculate a similarity between visual observations and natural language instructions in the pre-trained multimodal embedding space (such as CLIP) and use it as a reward signal. We then train a return-conditioned policy using expert demonstrations labeled with multimodal rewards. Because the multimodal rewards provide adaptive signals at each timestep, our ARP effectively mitigates the goal misgeneralization. This results in superior generalization performances even when faced with unseen text instructions, compared to existing text-conditioned policies. To improve the quality of rewards, we also introduce a fine-tuning method for pre-trained multimodal encoders, further enhancing the performance. Video demonstrations and source code are available on the project website: \url{<a class="link-external link-https" href="https://sites.google.com/view/2023arp" rel="external noopener nofollow">this https URL</a>}.

Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition,Robotics

What problem does this paper attempt to address?

The paper attempts to address the challenge of developing agents in Imitation Learning (IL) that can adapt to unseen environments. Specifically, existing IL methods often fail to generalize when faced with new environments due to overfitting to training data, resulting in meaningless behavior. To overcome this issue, the paper proposes a new framework—Adaptive Return-conditioned Policy (ARP), which aims to enhance the agent's generalization ability through natural language task descriptions and pre-trained multimodal encoders. The core idea of ARP is to use the similarity between visual observations and natural language instructions in a pre-trained multimodal embedding space (such as CLIP) as a reward signal, and then train a return-conditioned policy using expert demonstrations with these multimodal reward labels. This approach effectively mitigates the problem of goal misgeneralization by providing adaptive multimodal reward signals at each time step, thereby improving the agent's generalization performance when faced with unseen text instructions. Additionally, the paper introduces a fine-tuning scheme for the pre-trained multimodal encoder to further improve the quality of rewards and thus enhance performance. Experimental results show that ARP not only effectively guides the agent to avoid pursuing undesirable goals but also enables the execution of unseen text instructions in test environments, associating with new colors and shapes of target objects.

Guide Your Agent with Adaptive Multimodal Rewards

Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts

Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics

GAILPG: Multi-Agent Policy Gradient with Generative Adversarial Imitation Learning

Retroformer: Retrospective Large Language Agents with Policy Gradient Optimization

Efficient Language-instructed Skill Acquisition via Reward-Policy Co-Evolution

Efficient Policy Adaptation with Contrastive Prompt Ensemble for Embodied Agents

Agent-Time Attention for Sparse Rewards Multi-Agent Reinforcement Learning

CLIP-RT: Learning Language-Conditioned Robotic Policies from Natural Language Supervision

Learning Efficient Multi-Agent Cooperative Visual Exploration

GR-MG: Leveraging Partially Annotated Data via Multi-Modal Goal Conditioned Policy

Multigoal Visual Navigation With Collision Avoidance via Deep Reinforcement Learning

Cooperative Policy Learning with Pre-trained Heterogeneous Observation Representations

RoboMP$^2$: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models

From Goal-Conditioned to Language-Conditioned Agents via Vision-Language Models

Multi-Agent Collaborative Target Search Based on the Multi-Agent Deep Deterministic Policy Gradient with Emotional Intrinsic Motivation

Exploring into the Unseen: Enhancing Language-Conditioned Policy Generalization with Behavioral Information

Text-Aware Diffusion for Policy Learning

Affordance-Guided Reinforcement Learning via Visual Prompting

Adaptive Language-Guided Abstraction from Contrastive Explanations

Language to Rewards for Robotic Skill Synthesis