Augmenting Unsupervised Reinforcement Learning with Self-Reference

Andrew Zhao,Erle Zhu,Rui Lu,Matthieu Lin,Yong-Jin Liu,Gao Huang
2023-11-16
Abstract:Humans possess the ability to draw on past experiences explicitly when learning new tasks and applying them accordingly. We believe this capacity for self-referencing is especially advantageous for reinforcement learning agents in the unsupervised pretrain-then-finetune setting. During pretraining, an agent's past experiences can be explicitly utilized to mitigate the nonstationarity of intrinsic rewards. In the finetuning phase, referencing historical trajectories prevents the unlearning of valuable exploratory behaviors. Motivated by these benefits, we propose the Self-Reference (SR) approach, an add-on module explicitly designed to leverage historical information and enhance agent performance within the pretrain-finetune paradigm. Our approach achieves state-of-the-art results in terms of Interquartile Mean (IQM) performance and Optimality Gap reduction on the Unsupervised Reinforcement Learning Benchmark for model-free methods, recording an 86% IQM and a 16% Optimality Gap. Additionally, it improves current algorithms by up to 17% IQM and reduces the Optimality Gap by 31%. Beyond performance enhancement, the Self-Reference add-on also increases sample efficiency, a crucial attribute for real-world applications.
Machine Learning,Artificial Intelligence,Robotics
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper primarily addresses two core issues in Unsupervised Reinforcement Learning (URL): 1. **Non-stationarity Problem**: During the pre-training phase, the reward function implicitly depends on historical state transitions, causing the Markov Decision Process (MDP) to become non-stationary. Existing algorithms typically do not model this change, leading to unstable and inefficient learning. 2. **Forgetting Exploration Behavior Problem**: In the fine-tuning phase, the traditional pre-training-fine-tuning paradigm may lead to the forgetting of exploration behaviors from the pre-training policy, which is detrimental in scenarios requiring efficient adaptation to downstream tasks. To address these issues, the authors propose a method called Self-Reference (SR). SR is an auxiliary module designed to enhance the performance and efficiency of both the pre-training and fine-tuning phases by leveraging historical information. Specifically, SR achieves these goals through the following means: - **Explicit Utilization of Historical Information**: At each decision point, the agent is presented with historical experiences, enabling it to create statistical summaries of visited states and explicitly model changes in rewards. - **Preventing Forgetting of Exploration Behaviors**: By presenting old experiences, SR reduces the forgetting of pre-training exploration behaviors during the early stages of fine-tuning. Experimental results show that the SR method significantly improves performance across multiple benchmarks and enhances sample efficiency. Additionally, the authors propose an optional distillation phase to eliminate the extra computational overhead during subsequent deployment stages.