Harshit Sikchi,Siddhant Agarwal,Pranaya Jajoo,Samyak Parajuli,Caleb Chuck,Max Rudolph,Peter Stone,Amy Zhang,Scott Niekum
Abstract:Rewards remain an uninterpretable way to specify tasks for Reinforcement Learning, as humans are often unable to predict the optimal behavior of any given reward function, leading to poor reward design and reward hacking. Language presents an appealing way to communicate intent to agents and bypass reward design, but prior efforts to do so have been limited by costly and unscalable labeling efforts. In this work, we propose a method for a completely unsupervised alternative to grounding language instructions in a zero-shot manner to obtain policies. We present a solution that takes the form of imagine, project, and imitate: The agent imagines the observation sequence corresponding to the language description of a task, projects the imagined sequence to our target domain, and grounds it to a policy. Video-language models allow us to imagine task descriptions that leverage knowledge of tasks learned from internet-scale video-text mappings. The challenge remains to ground these generations to a policy. In this work, we show that we can achieve a zero-shot language-to-behavior policy by first grounding the imagined sequences in real observations of an unsupervised RL agent and using a closed-form solution to imitation learning that allows the RL agent to mimic the grounded observations. Our method, RLZero, is the first to our knowledge to show zero-shot language to behavior generation abilities without any supervision on a variety of tasks on simulated domains. We further show that RLZero can also generate policies zero-shot from cross-embodied videos such as those scraped from YouTube.
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenge of task specification in Reinforcement Learning (RL), especially how to transform language instructions into behavioral policies without supervision. Traditional Reinforcement Learning methods usually require experts to carefully design reward functions to specify tasks, which not only limits the scalability of RL agents but also makes these agents difficult to understand for users who are not familiar with reward design. Moreover, even for experts, simple reward functions are easily "hacked" (i.e., producing behaviors that do not conform to human intentions), resulting in poor reward design and reward "cheating".
To solve these problems, this paper proposes a completely unsupervised method to generate behavioral policies from language instructions zero - shot. Specifically, the paper introduces a method named RLZero, which consists of three steps: imagination, projection, and imitation. First, based on the task described by language, use a generative model to imagine the corresponding sequence of observations; then, project these imagined sequences onto the target domain, that is, align them with the observation data in the real environment; finally, through unsupervised Reinforcement Learning techniques, enable the RL agent to imitate these aligned observation data, thereby achieving the transformation from language instructions to behavioral policies.
The main contribution of this method is that it provides a framework for transforming language instructions into behavioral policies without any supervision, which has been verified in multiple simulation environments. In addition, RLZero also demonstrates the ability to generate policies zero - shot from cross - entity videos (such as videos crawled from YouTube), which is the first attempt as far as the authors know.