Robust Offline Reinforcement learning with Heavy-Tailed Rewards

Jin Zhu,Runzhe Wan,Zhengling Qi,Shikai Luo,Chengchun Shi
2024-03-31
Abstract:This paper endeavors to augment the robustness of offline reinforcement learning (RL) in scenarios laden with heavy-tailed rewards, a prevalent circumstance in real-world applications. We propose two algorithmic frameworks, ROAM and ROOM, for robust off-policy evaluation and offline policy optimization (OPO), respectively. Central to our frameworks is the strategic incorporation of the median-of-means method with offline RL, enabling straightforward uncertainty estimation for the value function estimator. This not only adheres to the principle of pessimism in OPO but also adeptly manages heavy-tailed rewards. Theoretical results and extensive experiments demonstrate that our two frameworks outperform existing methods on the logged dataset exhibits heavy-tailed reward distributions. The implementation of the proposal is available at <a class="link-external link-https" href="https://github.com/Mamba413/ROOM" rel="external noopener nofollow">this https URL</a>.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to improve the robustness of algorithms in offline reinforcement learning (offline RL) when the reward distribution exhibits heavy - tailed characteristics. In many real - world applications, the reward distribution is often heavy - tailed, which poses a great challenge to existing offline RL methods. Specifically: 1. **Off - Policy Evaluation (OPE)**: The goal of OPE is to use only historical data to evaluate the value of a new policy. However, standard regression methods are very sensitive to heavy - tailed rewards, resulting in a slower convergence speed and thus affecting the accuracy of policy evaluation. 2. **Offline Policy Optimization (OPO)**: In OPO, heavy - tailed rewards will exacerbate the over - estimation problem in standard RL algorithms. For example, in the multi - armed bandit example, the variance of estimating the expected reward is large, which may increase the probability of selecting a sub - optimal arm, thus affecting the effectiveness of policy optimization. To address these problems, the authors propose two new frameworks: ROAM and ROOM, for off - policy evaluation and off - policy optimization respectively. The core idea of these two frameworks is to introduce the median - of - means (MM) method in robust statistics to effectively handle heavy - tailed rewards and provide an intuitive method for uncertainty quantification. Through theoretical analysis and extensive experimental verification, these new frameworks outperform existing methods in handling heavy - tailed rewards. ### Formula Summary - **Median - of - Means Estimator**: \[ \text{MM}=\text{Median}\left(\left\{\frac{1}{|B_k|} \sum_{i \in B_k} R_i\right\}_{k = 1}^K\right) \] where \(R_i\) is the observed heavy - tailed reward, and \(B_k\) is the \(k\)-th subset after dividing the data into \(K\) parts. - **Q - function Estimation in the ROAM Algorithm**: \[ \hat{J}_\pi=\mathbb{E}_{s\sim G, a\sim\pi(\cdot|s)}\left[\text{Median}\left(\left\{\hat{Q}_\pi^k(s, a)\right\}_{k = 1}^K\right)\right] \] where \(\hat{Q}_\pi^k(s, a)\) is the Q - function estimated based on the \(k\)-th data subset. - **Policy Optimization in the ROOM Algorithm**: \[ \hat{\pi}^*(s)=\arg\max_a\text{Median}\left(\left\{\hat{Q}^{*k}(s, a)\right\}_{k = 1}^K\right) \] where \(\hat{Q}^{*k}(s, a)\) is the optimal Q - function estimated based on the \(k\)-th data subset. By introducing the median - of - means method, these frameworks not only improve the robustness of offline RL under heavy - tailed rewards, but also can effectively perform uncertainty quantification, so as to better meet the challenges in high - risk application scenarios.