Abstract:This paper endeavors to augment the robustness of offline reinforcement learning (RL) in scenarios laden with heavy-tailed rewards, a prevalent circumstance in real-world applications. We propose two algorithmic frameworks, ROAM and ROOM, for robust off-policy evaluation and offline policy optimization (OPO), respectively. Central to our frameworks is the strategic incorporation of the median-of-means method with offline RL, enabling straightforward uncertainty estimation for the value function estimator. This not only adheres to the principle of pessimism in OPO but also adeptly manages heavy-tailed rewards. Theoretical results and extensive experiments demonstrate that our two frameworks outperform existing methods on the logged dataset exhibits heavy-tailed reward distributions. The implementation of the proposal is available at <a class="link-external link-https" href="https://github.com/Mamba413/ROOM" rel="external noopener nofollow">this https URL</a>.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to improve the robustness of algorithms in offline reinforcement learning (offline RL) when the reward distribution exhibits heavy - tailed characteristics. In many real - world applications, the reward distribution is often heavy - tailed, which poses a great challenge to existing offline RL methods. Specifically: 1. **Off - Policy Evaluation (OPE)**: The goal of OPE is to use only historical data to evaluate the value of a new policy. However, standard regression methods are very sensitive to heavy - tailed rewards, resulting in a slower convergence speed and thus affecting the accuracy of policy evaluation. 2. **Offline Policy Optimization (OPO)**: In OPO, heavy - tailed rewards will exacerbate the over - estimation problem in standard RL algorithms. For example, in the multi - armed bandit example, the variance of estimating the expected reward is large, which may increase the probability of selecting a sub - optimal arm, thus affecting the effectiveness of policy optimization. To address these problems, the authors propose two new frameworks: ROAM and ROOM, for off - policy evaluation and off - policy optimization respectively. The core idea of these two frameworks is to introduce the median - of - means (MM) method in robust statistics to effectively handle heavy - tailed rewards and provide an intuitive method for uncertainty quantification. Through theoretical analysis and extensive experimental verification, these new frameworks outperform existing methods in handling heavy - tailed rewards. ### Formula Summary - **Median - of - Means Estimator**: \[ \text{MM}=\text{Median}\left(\left\{\frac{1}{|B_k|} \sum_{i \in B_k} R_i\right\}_{k = 1}^K\right) \] where \(R_i\) is the observed heavy - tailed reward, and \(B_k\) is the \(k\)-th subset after dividing the data into \(K\) parts. - **Q - function Estimation in the ROAM Algorithm**: \[ \hat{J}_\pi=\mathbb{E}_{s\sim G, a\sim\pi(\cdot|s)}\left[\text{Median}\left(\left\{\hat{Q}_\pi^k(s, a)\right\}_{k = 1}^K\right)\right] \] where \(\hat{Q}_\pi^k(s, a)\) is the Q - function estimated based on the \(k\)-th data subset. - **Policy Optimization in the ROOM Algorithm**: \[ \hat{\pi}^*(s)=\arg\max_a\text{Median}\left(\left\{\hat{Q}^{*k}(s, a)\right\}_{k = 1}^K\right) \] where \(\hat{Q}^{*k}(s, a)\) is the optimal Q - function estimated based on the \(k\)-th data subset. By introducing the median - of - means method, these frameworks not only improve the robustness of offline RL under heavy - tailed rewards, but also can effectively perform uncertainty quantification, so as to better meet the challenges in high - risk application scenarios.

Robust Offline Reinforcement learning with Heavy-Tailed Rewards

Beyond Reward: Offline Preference-guided Policy Optimization

Robust Reinforcement Learning using Offline Data

Towards Robust Offline-to-Online Reinforcement Learning via Uncertainty and Smoothness

Double Pessimism is Provably Efficient for Distributionally Robust Offline Reinforcement Learning: Generic Algorithm and Robust Partial Coverage

Online Policy Optimization for Robust MDP

Robust Offline Reinforcement Learning for Non-Markovian Decision Processes

Robust Offline Reinforcement Learning from Low-Quality Data

Optimistic Model Rollouts for Pessimistic Offline Policy Optimization

MOPO: Model-based Offline Policy Optimization

Pessimism Meets Risk: Risk-Sensitive Offline Reinforcement Learning

Adjustable Robust Reinforcement Learning for Online 3D Bin Packing

User-Oriented Robust Reinforcement Learning

Mind the Gap: Offline Policy Optimization for Imperfect Rewards.

Robust Offline Reinforcement Learning with Gradient Penalty and Constraint Relaxation

Robust Reinforcement Learning for Continuous Control with Model Misspecification

MICRO: Model-Based Offline Reinforcement Learning with a Conservative Bellman Operator

Robust Model-Based Reinforcement Learning with an Adversarial Auxiliary Model

On Practical Robust Reinforcement Learning: Adjacent Uncertainty Set and Double-Agent Algorithm.

Online Estimation and Inference for Robust Policy Evaluation in Reinforcement Learning

ROLeR: Effective Reward Shaping in Offline Reinforcement Learning for Recommender Systems