Adversarial Model for Offline Reinforcement Learning

Mohak Bhardwaj,Tengyang Xie,Byron Boots,Nan Jiang,Ching-An Cheng
2023-12-24
Abstract:We propose a novel model-based offline Reinforcement Learning (RL) framework, called Adversarial Model for Offline Reinforcement Learning (ARMOR), which can robustly learn policies to improve upon an arbitrary reference policy regardless of data coverage. ARMOR is designed to optimize policies for the worst-case performance relative to the reference policy through adversarially training a Markov decision process model. In theory, we prove that ARMOR, with a well-tuned hyperparameter, can compete with the best policy within data coverage when the reference policy is supported by the data. At the same time, ARMOR is robust to hyperparameter choices: the policy learned by ARMOR, with "any" admissible hyperparameter, would never degrade the performance of the reference policy, even when the reference policy is not covered by the dataset. To validate these properties in practice, we design a scalable implementation of ARMOR, which by adversarial training, can optimize policies without using model ensembles in contrast to typical model-based methods. We show that ARMOR achieves competent performance with both state-of-the-art offline model-free and model-based RL algorithms and can robustly improve the reference policy over various hyperparameter choices.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The paper aims to address two key issues in Offline Reinforcement Learning (ORL): 1. **Performance Degradation Issue**: In practical applications, reinforcement learning algorithms are often used to improve existing policies (such as autonomous driving rules or heuristic-based diagnostic systems). Therefore, the newly learned policy should not degrade the performance of the original policy. However, pessimism-based offline reinforcement learning algorithms cannot guarantee this, especially when the reference policy is inconsistent with the behavior policy that collected the data. 2. **Data Coverage Issue**: A core challenge of offline reinforcement learning is insufficient data coverage, meaning that the training data may not cover all important scenarios. This limits the quality of the optimal policy that the algorithm can learn. To address these issues, the paper proposes a new model-based offline reinforcement learning framework—**Adversarial Model for Offline Reinforcement Learning (ARMOR)**. ARMOR achieves its goals through the following methods: - **Relative Pessimism**: ARMOR adopts a concept called "relative pessimism," which optimizes the policy to ensure good performance relative to the reference policy even in the worst-case scenario. This design ensures that the policy learned by ARMOR is not worse than the reference policy and can compete for the optimal policy within the data coverage. - **Adversarial Model Training**: ARMOR achieves the above goal by adversarially training a Markov Decision Process (MDP) model. Specifically, during training, ARMOR simultaneously optimizes a policy and an MDP model, where the MDP model is designed to make the reference policy outperform the learned policy as much as possible, thereby forcing the learned policy to perform well even in areas of high uncertainty. The paper also provides theoretical analysis, proving that ARMOR can meet the "Robust Policy Improvement (RPI)" property under appropriate hyperparameter settings, meaning that the policy learned by ARMOR will not be worse than the given reference policy. Additionally, ARMOR can compete with any other policy within the data coverage. The experimental section demonstrates the performance of ARMOR on multiple benchmark datasets, including continuous control tasks, showing that ARMOR can achieve or exceed the performance of existing offline reinforcement learning algorithms in many cases. Notably, ARMOR can achieve these results using only a single model without the need for a complex ensemble of models, making ARMOR more advantageous when using high-capacity world models.