Self-Supervised Reinforcement Learning that Transfers using Random Features

Boyuan Chen,Chuning Zhu,Pulkit Agrawal,Kaiqing Zhang,Abhishek Gupta
DOI: https://doi.org/10.48550/arXiv.2305.17250
2023-05-27
Abstract:Model-free reinforcement learning algorithms have exhibited great potential in solving single-task sequential decision-making problems with high-dimensional observations and long horizons, but are known to be hard to generalize across tasks. Model-based RL, on the other hand, learns task-agnostic models of the world that naturally enables transfer across different reward functions, but struggles to scale to complex environments due to the compounding error. To get the best of both worlds, we propose a self-supervised reinforcement learning method that enables the transfer of behaviors across tasks with different rewards, while circumventing the challenges of model-based RL. In particular, we show self-supervised pre-training of model-free reinforcement learning with a number of random features as rewards allows implicit modeling of long-horizon environment dynamics. Then, planning techniques like model-predictive control using these implicit models enable fast adaptation to problems with new reward functions. Our method is self-supervised in that it can be trained on offline datasets without reward labels, but can then be quickly deployed on new tasks. We validate that our proposed method enables transfer across tasks on a variety of manipulation and locomotion domains in simulation, opening the door to generalist decision-making agents.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to achieve effective transfer learning between different tasks, especially in the case where the environmental dynamics are the same but the reward functions are different. Specifically, the author focuses on how to construct general - purpose agents that can perform well in multiple tasks, which are characterized by high - dimensional observations and long time horizons. Although traditional reinforcement learning (RL) algorithms perform well on a single task, their generalization ability between different tasks is poor. Model - agnostic RL algorithms have difficulty dealing with long - term planning problems in complex environments, and while model - based RL can better transfer between tasks, it is difficult to scale in complex environments due to the influence of cumulative errors. To solve the above problems, the paper proposes a self - supervised reinforcement learning method - Random Features for Model - Free Planning (RaMP). This method implicitly models the long - term dynamics of the environment by using a large number of random features as rewards, thereby bypassing the cumulative error problem in model - based RL and being able to quickly adapt to new reward functions at test time. This method allows agents to be pre - trained on offline datasets without explicit reward labels and then quickly adjust their strategies when encountering new tasks. ### Method Overview 1. **Offline Training Phase**: Utilize the collected exploration trajectories (without reward labels) to train the Q - basis functions through the accumulation of a series of random features. Each Q - basis function corresponds to the Q - value of a random feature, and these functions can capture the dynamic characteristics of the environment without depending on specific tasks or rewards. 2. **Online Rapid Adaptation Phase**: When encountering a new task, use linear regression to combine random features into the Q - function corresponding to the new reward function through a small number of interactions with the environment. Then, use model predictive control (MPC) technology to plan according to the inferred Q - function and find the optimal sequence of actions. ### Theoretical Basis - **Theorem 3.1**: Under standard coverage and sampling assumptions, given any reward function R and policy π, by appropriately selecting the number of random features K, the length H of each trajectory, and the number of samples M in the dataset D, the gap between the estimated Q - function and the true Q - function can be made less than O(ϵ), where ϵ is an adjustable small quantity. - **Theorem 3.2**: In a deterministic transition environment, for a given reward function R, the value function Vπ′_H of the policy π′_H obtained through multi - step policy improvement from the policy class Π is superior to all possible single - step or multi - step action sequence policies. In this way, RaMP can not only handle problems with high - dimensional observations and long time horizons but also achieve effective cross - task transfer without repeating a large amount of computation. Experimental results show that RaMP exhibits good performance in tasks such as robotic arm manipulation and motion control in simulated environments.