Abstract:We propose a novel model-based offline Reinforcement Learning (RL) framework, called Adversarial Model for Offline Reinforcement Learning (ARMOR), which can robustly learn policies to improve upon an arbitrary reference policy regardless of data coverage. ARMOR is designed to optimize policies for the worst-case performance relative to the reference policy through adversarially training a Markov decision process model. In theory, we prove that ARMOR, with a well-tuned hyperparameter, can compete with the best policy within data coverage when the reference policy is supported by the data. At the same time, ARMOR is robust to hyperparameter choices: the policy learned by ARMOR, with "any" admissible hyperparameter, would never degrade the performance of the reference policy, even when the reference policy is not covered by the dataset. To validate these properties in practice, we design a scalable implementation of ARMOR, which by adversarial training, can optimize policies without using model ensembles in contrast to typical model-based methods. We show that ARMOR achieves competent performance with both state-of-the-art offline model-free and model-based RL algorithms and can robustly improve the reference policy over various hyperparameter choices.

What problem does this paper attempt to address?

The paper aims to address two key issues in Offline Reinforcement Learning (ORL): 1. **Performance Degradation Issue**: In practical applications, reinforcement learning algorithms are often used to improve existing policies (such as autonomous driving rules or heuristic-based diagnostic systems). Therefore, the newly learned policy should not degrade the performance of the original policy. However, pessimism-based offline reinforcement learning algorithms cannot guarantee this, especially when the reference policy is inconsistent with the behavior policy that collected the data. 2. **Data Coverage Issue**: A core challenge of offline reinforcement learning is insufficient data coverage, meaning that the training data may not cover all important scenarios. This limits the quality of the optimal policy that the algorithm can learn. To address these issues, the paper proposes a new model-based offline reinforcement learning framework—**Adversarial Model for Offline Reinforcement Learning (ARMOR)**. ARMOR achieves its goals through the following methods: - **Relative Pessimism**: ARMOR adopts a concept called "relative pessimism," which optimizes the policy to ensure good performance relative to the reference policy even in the worst-case scenario. This design ensures that the policy learned by ARMOR is not worse than the reference policy and can compete for the optimal policy within the data coverage. - **Adversarial Model Training**: ARMOR achieves the above goal by adversarially training a Markov Decision Process (MDP) model. Specifically, during training, ARMOR simultaneously optimizes a policy and an MDP model, where the MDP model is designed to make the reference policy outperform the learned policy as much as possible, thereby forcing the learned policy to perform well even in areas of high uncertainty. The paper also provides theoretical analysis, proving that ARMOR can meet the "Robust Policy Improvement (RPI)" property under appropriate hyperparameter settings, meaning that the policy learned by ARMOR will not be worse than the given reference policy. Additionally, ARMOR can compete with any other policy within the data coverage. The experimental section demonstrates the performance of ARMOR on multiple benchmark datasets, including continuous control tasks, showing that ARMOR can achieve or exceed the performance of existing offline reinforcement learning algorithms in many cases. Notably, ARMOR can achieve these results using only a single model without the need for a complex ensemble of models, making ARMOR more advantageous when using high-capacity world models.

Adversarial Model for Offline Reinforcement Learning

Towards Robust Policy: Enhancing Offline Reinforcement Learning with Adversarial Attacks and Defenses

Online Robust Policy Learning in the Presence of Unknown Adversaries

MOReL : Model-Based Offline Reinforcement Learning

Adaptive Policy Learning for Offline-to-Online Reinforcement Learning

Robust Reinforcement Learning using Offline Data

Robust Offline Reinforcement Learning from Low-Quality Data

Robust Offline Reinforcement Learning for Non-Markovian Decision Processes

Model-Based Offline Planning

Risk-Averse Offline Reinforcement Learning

A Non-Monolithic Policy Approach of Offline-to-Online Reinforcement Learning

Marvel: Accelerating Safe Online Reinforcement Learning with Finetuned Offline Policy

Offline Reinforcement Learning with Reverse Model-based Imagination

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

Beyond Worst-case Attacks: Robust RL with Adaptive Defense via Non-dominated Policies

Robust Multi-Agent Reinforcement Learning via Adversarial Regularization: Theoretical Foundation and Stable Algorithms

Bayes Adaptive Monte Carlo Tree Search for Offline Model-based Reinforcement Learning

Deploying Offline Reinforcement Learning with Human Feedback

Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models

Self-Confirming Transformer for Locally Consistent Online Adaptation in Multi-Agent Reinforcement Learning

MoMA: Model-based Mirror Ascent for Offline Reinforcement Learning