Guide Your Agent with Adaptive Multimodal Rewards

Changyeon Kim,Younggyo Seo,Hao Liu,Lisa Lee,Jinwoo Shin,Honglak Lee,Kimin Lee
2023-10-25
Abstract:Developing an agent capable of adapting to unseen environments remains a difficult challenge in imitation learning. This work presents Adaptive Return-conditioned Policy (ARP), an efficient framework designed to enhance the agent's generalization ability using natural language task descriptions and pre-trained multimodal encoders. Our key idea is to calculate a similarity between visual observations and natural language instructions in the pre-trained multimodal embedding space (such as CLIP) and use it as a reward signal. We then train a return-conditioned policy using expert demonstrations labeled with multimodal rewards. Because the multimodal rewards provide adaptive signals at each timestep, our ARP effectively mitigates the goal misgeneralization. This results in superior generalization performances even when faced with unseen text instructions, compared to existing text-conditioned policies. To improve the quality of rewards, we also introduce a fine-tuning method for pre-trained multimodal encoders, further enhancing the performance. Video demonstrations and source code are available on the project website: \url{<a class="link-external link-https" href="https://sites.google.com/view/2023arp" rel="external noopener nofollow">this https URL</a>}.
Machine Learning,Artificial Intelligence,Computer Vision and Pattern Recognition,Robotics
What problem does this paper attempt to address?
The paper attempts to address the challenge of developing agents in Imitation Learning (IL) that can adapt to unseen environments. Specifically, existing IL methods often fail to generalize when faced with new environments due to overfitting to training data, resulting in meaningless behavior. To overcome this issue, the paper proposes a new framework—Adaptive Return-conditioned Policy (ARP), which aims to enhance the agent's generalization ability through natural language task descriptions and pre-trained multimodal encoders. The core idea of ARP is to use the similarity between visual observations and natural language instructions in a pre-trained multimodal embedding space (such as CLIP) as a reward signal, and then train a return-conditioned policy using expert demonstrations with these multimodal reward labels. This approach effectively mitigates the problem of goal misgeneralization by providing adaptive multimodal reward signals at each time step, thereby improving the agent's generalization performance when faced with unseen text instructions. Additionally, the paper introduces a fine-tuning scheme for the pre-trained multimodal encoder to further improve the quality of rewards and thus enhance performance. Experimental results show that ARP not only effectively guides the agent to avoid pursuing undesirable goals but also enables the execution of unseen text instructions in test environments, associating with new colors and shapes of target objects.