OER: Offline Experience Replay for Continual Offline Reinforcement Learning

Sibo Gai,Donglin Wang,Li He
DOI: https://doi.org/10.3233/FAIA230343
2024-04-20
Abstract:The capability of continuously learning new skills via a sequence of pre-collected offline datasets is desired for an agent. However, consecutively learning a sequence of offline tasks likely leads to the catastrophic forgetting issue under resource-limited scenarios. In this paper, we formulate a new setting, continual offline reinforcement learning (CORL), where an agent learns a sequence of offline reinforcement learning tasks and pursues good performance on all learned tasks with a small replay buffer without exploring any of the environments of all the sequential tasks. For consistently learning on all sequential tasks, an agent requires acquiring new knowledge and meanwhile preserving old knowledge in an offline manner. To this end, we introduced continual learning algorithms and experimentally found experience replay (ER) to be the most suitable algorithm for the CORL problem. However, we observe that introducing ER into CORL encounters a new distribution shift problem: the mismatch between the experiences in the replay buffer and trajectories from the learned policy. To address such an issue, we propose a new model-based experience selection (MBES) scheme to build the replay buffer, where a transition model is learned to approximate the state distribution. This model is used to bridge the distribution bias between the replay buffer and the learned model by filtering the data from offline data that most closely resembles the learned model for storage. Moreover, in order to enhance the ability on learning new tasks, we retrofit the experience replay method with a new dual behavior cloning (DBC) architecture to avoid the disturbance of behavior-cloning loss on the Q-learning process. In general, we call our algorithm offline experience replay (OER). Extensive experiments demonstrate that our OER method outperforms SOTA baselines in widely-used Mujoco environments.
Machine Learning
What problem does this paper attempt to address?
This paper proposes a new framework for offline reinforcement learning called Continuous Offline Reinforcement Learning (CORL), aiming to solve the catastrophic forgetting problem when continuously learning new skills through a series of pre-collected offline datasets under limited resources. In traditional offline reinforcement learning, learning tasks from static datasets suffer from overestimation issues caused by distribution shift. In a continuous learning environment, it is necessary to prevent performance degradation on previous tasks while learning new tasks. The paper observes that directly applying the Experience Replay (ER) method to CORL leads to two distribution shift problems: the shift between the behavior policy and the learning policy, and the shift between the experiences in the replay buffer and the learning policy trajectories. To address these problems, the paper proposes two innovative approaches: 1. Model-based Experience Selection (MBES): Utilizing a dynamic model to approximate the state distribution, it selects offline data that is most similar to the learning model and stores it in the replay buffer, reducing distribution shift. 2. Dual Behavioral Cloning (DBC) architecture: To avoid interference of behavioral cloning loss on the Q-learning process, the paper introduces a new structure where one policy network optimizes the current task and another policy network optimizes both the new and old tasks from a continuous learning perspective. Through MBES and DBC, the paper presents the Offline Experience Replay (OER) algorithm. Experimental results demonstrate that OER outperforms existing state-of-the-art baselines in commonly used Mujoco environments, proving its effectiveness in handling continuous control tasks.