Adaptive Policy Learning for Offline-to-Online Reinforcement Learning

Han Zheng,Xufang Luo,Pengfei Wei,Xuan Song,Dongsheng Li,Jing Jiang
DOI: https://doi.org/10.48550/arXiv.2303.07693
2023-03-14
Abstract:Conventional reinforcement learning (RL) needs an environment to collect fresh data, which is impractical when online interactions are costly. Offline RL provides an alternative solution by directly learning from the previously collected dataset. However, it will yield unsatisfactory performance if the quality of the offline datasets is poor. In this paper, we consider an offline-to-online setting where the agent is first learned from the offline dataset and then trained online, and propose a framework called Adaptive Policy Learning for effectively taking advantage of offline and online data. Specifically, we explicitly consider the difference between the online and offline data and apply an adaptive update scheme accordingly, that is, a pessimistic update strategy for the offline dataset and an optimistic/greedy update scheme for the online dataset. Such a simple and effective method provides a way to mix the offline and online RL and achieve the best of both worlds. We further provide two detailed algorithms for implementing the framework through embedding value or policy-based RL algorithms into it. Finally, we conduct extensive experiments on popular continuous control tasks, and results show that our algorithm can learn the expert policy with high sample efficiency even when the quality of offline dataset is poor, e.g., random dataset.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is how to effectively combine the advantages of offline data and online data in reinforcement learning to improve the performance of the learning policy. Specifically, the paper focuses on the offline - to - online setting, that is, first using an offline data set to train an agent, and then further training the agent through online interactions. Traditional methods face the following challenges when dealing with this setting: 1. **Quality problems of offline data**: When the quality of the offline data set is poor, learning methods that rely solely on offline data usually perform poorly. 2. **Insufficient utilization of online data**: Existing methods may not fully utilize the advantages of online data when using it, resulting in low learning efficiency. 3. **Distribution mismatch problem**: The distribution differences between offline data and online data may lead to performance degradation during the learning process. To solve these problems, the paper proposes a new framework - Adaptive Policy Learning (APL), which can effectively combine the advantages of offline and online data. Specifically, the core ideas of the APL framework are as follows: - **Optimistic update strategy**: When learning from online data, an optimistic update strategy is adopted because these data reflect the real situation of the current policy. - **Pessimistic update strategy**: When learning from offline data, a pessimistic update strategy is adopted to prevent over - fitting and distribution mismatch problems. - **Two - layer replay buffer**: A two - layer replay buffer (Online - Offline Replay Buffer, OORB) is designed to distinguish near - online policy data and offline data. In this way, the APL framework can flexibly switch update strategies between different data sources, thereby achieving better learning results. The paper also provides two implementation methods based on value functions and policies, and verifies the effectiveness of the framework through experiments. The experimental results show that the APL framework performs well on a variety of continuous control tasks, especially when the quality of offline data is poor, it can still efficiently learn expert policies.