Language-Model-Assisted Bi-Level Programming for Reward Learning from Internet Videos

Harsh Mahesheka,Zhixian Xie,Zhaoran Wang,Wanxin Jin
2024-10-12
Abstract:Learning from Demonstrations, particularly from biological experts like humans and animals, often encounters significant data acquisition challenges. While recent approaches leverage internet videos for learning, they require complex, task-specific pipelines to extract and retarget motion data for the agent. In this work, we introduce a language-model-assisted bi-level programming framework that enables a reinforcement learning agent to directly learn its reward from internet videos, bypassing dedicated data preparation. The framework includes two levels: an upper level where a vision-language model (VLM) provides feedback by comparing the learner's behavior with expert videos, and a lower level where a large language model (LLM) translates this feedback into reward updates. The VLM and LLM collaborate within this bi-level framework, using a "chain rule" approach to derive a valid search direction for reward learning. We validate the method for reward learning from YouTube videos, and the results have shown that the proposed method enables efficient reward design from expert videos of biological agents for complex behavior synthesis.
Robotics,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to directly learn the reward function of robots from Internet videos without special data preparation. Specifically, the existing Learning from Demonstration (LfD) methods face challenges in obtaining expert data, especially when using living organisms such as humans and animals as demonstrators. Although some methods use Internet videos for learning, these methods usually require complex task - specific pipelines to extract and redirect motion data to agents. This paper proposes a two - level programming framework assisted by a language model, enabling reinforcement - learning agents to directly learn their reward functions from Internet videos, thus bypassing the special data - preparation steps. This framework is implemented through two levels: the upper level uses a Vision - Language Model (VLM) to provide feedback, comparing the learner's actions with expert videos; the lower level uses a Large Language Model (LLM) to convert this feedback into reward updates. The VLM and LLM collaborate within this two - level framework, adopting the "chain rule" method to derive an effective reward - learning search direction. Through this method, the researchers verified the effectiveness of learning rewards from YouTube videos, demonstrating that this method can efficiently design rewards from expert videos of biological agents for the synthesis of complex behaviors.