Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning

Trevor McInroe,Adam Jelley,Stefano V. Albrecht,Amos Storkey
2024-06-21
Abstract:Offline pretraining with a static dataset followed by online fine-tuning (offline-to-online, or OtO) is a paradigm well matched to a real-world RL deployment process. In this scenario, we aim to find the best-performing policy within a limited budget of online interactions. Previous work in the OtO setting has focused on correcting for bias introduced by the policy-constraint mechanisms of offline RL algorithms. Such constraints keep the learned policy close to the behavior policy that collected the dataset, but we show this can unnecessarily limit policy performance if the behavior policy is far from optimal. Instead, we forgo constraints and frame OtO RL as an exploration problem that aims to maximize the benefit of online data-collection. We first study the major online RL exploration methods based on intrinsic rewards and UCB in the OtO setting, showing that intrinsic rewards add training instability through reward-function modification, and UCB methods are myopic and it is unclear which learned-component's ensemble to use for action selection. We then introduce an algorithm for planning to go out-of-distribution (PTGOOD) that avoids these issues. PTGOOD uses a non-myopic planning procedure that targets exploration in relatively high-reward regions of the state-action space unlikely to be visited by the behavior policy. By leveraging concepts from the Conditional Entropy Bottleneck, PTGOOD encourages data collected online to provide new information relevant to improving the final deployment policy without altering rewards. We show empirically in several continuous control tasks that PTGOOD significantly improves agent returns during online fine-tuning and avoids the suboptimal policy convergence that many of our baselines exhibit in several environments.
Machine Learning
What problem does this paper attempt to address?
### What problem does this paper attempt to solve? This paper aims to solve the exploration problem in Offline - to - Online Reinforcement Learning (OtO RL). Specifically, the researchers focus on how to improve the performance of the final policy through effective exploration strategies within a limited online interaction budget. Traditional OtO RL methods mainly focus on correcting the bias introduced by the policy - constraint mechanism of the offline RL algorithm, and this paper proposes a new method - **Planning to Go Out - of - Distribution (PTGOOD)** to overcome the limitations of existing methods. #### Main problem description 1. **Combination of offline pre - training and online fine - tuning**: - Offline pre - training is carried out using a static data set, and then fine - tuned through limited online interactions. - The goal is to find the best - performing policy within a limited online interaction budget. 2. **Limitations of existing methods**: - Existing OtO RL methods usually rely on policy - constraint mechanisms, which will limit the learned policy to be close to the behavior policy, which may limit the policy performance, especially when the behavior policy itself is not ideal. - These constraint mechanisms may lead to learning instability and sub - optimal convergence during online fine - tuning. 3. **Importance of the exploration problem**: - During the online fine - tuning stage, the agent must carefully select the state - action pairs to be collected because the number of environmental interactions is limited. - Traditional online exploration methods (such as methods based on intrinsic rewards and UCB) have deficiencies in the OtO setting. For example, the intrinsic reward method will lead to training instability, and the UCB method is too short - sighted and it is uncertain which set of components to use for action selection. #### Proposed solution To overcome the above problems, this paper proposes the PTGOOD algorithm, which improves the exploration strategy in the following ways: 1. **Non - myopic planning process**: - PTGOOD uses a multi - step planning procedure, aiming to maximize the "out - of - distribution" probability of the collected transition tuples relative to the offline data set, thus avoiding redundant data. 2. **Exploration of high - reward areas**: - PTGOOD ensures that the exploration guidance does not deviate too far from the policy being fine - tuned. This is achieved by sampling the policy and adding a small amount of noise, thus naturally aiming at higher - reward areas. 3. **Utilizing Conditional Entropy Bottleneck (CEB)**: - PTGOOD uses CEB to estimate the density of state - action pairs in the offline data set, identify high - reward areas that are unlikely to be visited by the behavior policy, and encourage the online - collected data to provide new information that is helpful for improving the final deployed policy. Through these improvements, PTGOOD can significantly increase the agent's return in continuous - control tasks and avoid the sub - optimal policy convergence problems that many baseline methods have in multiple environments.