Augmenting Unsupervised Reinforcement Learning with Self-Reference

Andrew Zhao,Erle Zhu,Rui Lu,Matthieu Lin,Yong-Jin Liu,Gao Huang

2023-11-16

Abstract:Humans possess the ability to draw on past experiences explicitly when learning new tasks and applying them accordingly. We believe this capacity for self-referencing is especially advantageous for reinforcement learning agents in the unsupervised pretrain-then-finetune setting. During pretraining, an agent's past experiences can be explicitly utilized to mitigate the nonstationarity of intrinsic rewards. In the finetuning phase, referencing historical trajectories prevents the unlearning of valuable exploratory behaviors. Motivated by these benefits, we propose the Self-Reference (SR) approach, an add-on module explicitly designed to leverage historical information and enhance agent performance within the pretrain-finetune paradigm. Our approach achieves state-of-the-art results in terms of Interquartile Mean (IQM) performance and Optimality Gap reduction on the Unsupervised Reinforcement Learning Benchmark for model-free methods, recording an 86% IQM and a 16% Optimality Gap. Additionally, it improves current algorithms by up to 17% IQM and reduces the Optimality Gap by 31%. Beyond performance enhancement, the Self-Reference add-on also increases sample efficiency, a crucial attribute for real-world applications.

Machine Learning,Artificial Intelligence,Robotics

What problem does this paper attempt to address?

### Problems Addressed by the Paper The paper primarily addresses two core issues in Unsupervised Reinforcement Learning (URL): 1. **Non-stationarity Problem**: During the pre-training phase, the reward function implicitly depends on historical state transitions, causing the Markov Decision Process (MDP) to become non-stationary. Existing algorithms typically do not model this change, leading to unstable and inefficient learning. 2. **Forgetting Exploration Behavior Problem**: In the fine-tuning phase, the traditional pre-training-fine-tuning paradigm may lead to the forgetting of exploration behaviors from the pre-training policy, which is detrimental in scenarios requiring efficient adaptation to downstream tasks. To address these issues, the authors propose a method called Self-Reference (SR). SR is an auxiliary module designed to enhance the performance and efficiency of both the pre-training and fine-tuning phases by leveraging historical information. Specifically, SR achieves these goals through the following means: - **Explicit Utilization of Historical Information**: At each decision point, the agent is presented with historical experiences, enabling it to create statistical summaries of visited states and explicitly model changes in rewards. - **Preventing Forgetting of Exploration Behaviors**: By presenting old experiences, SR reduces the forgetting of pre-training exploration behaviors during the early stages of fine-tuning. Experimental results show that the SR method significantly improves performance across multiple benchmarks and enhances sample efficiency. Additionally, the authors propose an optional distillation phase to eliminate the extra computational overhead during subsequent deployment stages.

Augmenting Unsupervised Reinforcement Learning with Self-Reference

Unsupervised Discovery of Transitional Skills for Deep Reinforcement Learning

Self-Supervised Exploration via Temporal Inconsistency in Reinforcement Learning

Learning How to Self-Learn: Enhancing Self-Training Using Neural Reinforcement Learning

Intrinsically Motivated Self-supervised Learning in Reinforcement Learning

Re-ReST: Reflection-Reinforced Self-Training for Language Agents

Reinforcement Learning with Unsupervised Auxiliary Tasks

Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration

Integrating human learning and reinforcement learning: A novel approach to agent training

Generalizing Reinforcement Learning through Fusing Self-Supervised Learning into Intrinsic Motivation

Just Say What You Want: Only-prompting Self-rewarding Online Preference Optimization

Semi-Supervised Reward Modeling via Iterative Self-Training

Self-Improving Robots: End-to-End Autonomous Visuomotor Reinforcement Learning

Self-Supervised Discovering of Interpretable Features for Reinforcement Learning

SuperHF: Supervised Iterative Learning from Human Feedback

Skill-Based Reinforcement Learning with Intrinsic Reward Matching

Enhanced Generalization through Prioritization and Diversity in Self-Imitation Reinforcement Learning over Procedural Environments with Sparse Rewards

Self-Improvement in Language Models: The Sharpening Mechanism

Teach and Explore: A Multiplex Information-guided Effective and Efficient Reinforcement Learning for Sequential Recommendation

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Intuitive Fine-Tuning: Towards Unifying SFT and RLHF into a Single Process