On the Curses of Future and History in Future-dependent Value Functions for Off-policy Evaluation

Yuheng Zhang,Nan Jiang
2024-10-03
Abstract:We study off-policy evaluation (OPE) in partially observable environments with complex observations, with the goal of developing estimators whose guarantee avoids exponential dependence on the horizon. While such estimators exist for MDPs and POMDPs can be converted to history-based MDPs, their estimation errors depend on the state-density ratio for MDPs which becomes history ratios after conversion, an exponential object. Recently, Uehara et al. [2022a] proposed future-dependent value functions as a promising framework to address this issue, where the guarantee for memoryless policies depends on the density ratio over the latent state space. However, it also depends on the boundedness of the future-dependent value function and other related quantities, which we show could be exponential-in-length and thus erasing the advantage of the method. In this paper, we discover novel coverage assumptions tailored to the structure of POMDPs, such as outcome coverage and belief coverage, which enable polynomial bounds on the aforementioned quantities. As a side product, our analyses also lead to the discovery of new algorithms with complementary properties.
Machine Learning,Artificial Intelligence
What problem does this paper attempt to address?
This paper attempts to solve the problem of exponential dependence on the time horizon encountered in off - policy evaluation (OPE) in partially observable environments (such as partially observable Markov decision processes, POMDPs). Specifically, the paper focuses on how to avoid the estimation error growing exponentially with the time horizon, especially in complex observational environments. ### Main problems and background Off - policy evaluation (OPE) refers to using historical data (usually data collected through different behavioral policies) to estimate the performance of a new policy. This problem is very important in the field of reinforcement learning, but very difficult in practical applications. Traditional methods such as importance sampling (IS) and Fitted - Q Evaluation (FQE) are effective, but their estimation errors often grow exponentially with the time horizon, resulting in poor performance in long - time horizons. ### Core contributions of the paper 1. **Future - Dependent Value Function (FDVF)**: Uehara et al. proposed a new framework - Future - dependent Value Functions (FDVF) - to deal with the exponential - dependence problem. FDVF takes future observations as input, thus avoiding directly dealing with the historical - dependence problem. However, the effectiveness of FDVF depends on some difficult - to - interpret assumptions, such as its boundedness and the boundedness of other related quantities. 2. **New coverage assumptions**: In order to ensure the boundedness of FDVF and provide more effective estimation guarantees, the authors proposed two new coverage assumptions: - **Outcome Coverage**: Ensure the overlap between the behavioral policy πb starting from the current time step and the evaluation policy πe. - **Belief Coverage**: Ensure that the belief - state distribution under the behavioral policy πb can cover the average belief - state distribution under the evaluation policy πe. 3. **Improved algorithm**: Based on these new coverage assumptions, the authors developed a new algorithm that can provide effective estimation guarantees in polynomial time and avoid explicit dependence on the size of the hidden state space. ### Formula summary - **Definition of the future - dependent value function**: \[ V_F(f_h)=\mathbb{E}_{\pi_b}[V_{\pi_e}^S(s_h)|f_h] \] where \( V_{\pi_e}^S(s_h) \) is the value function under the hidden state \( s_h \). - **L2 outcome coverage assumption**: \[ \|\mathbf{V}_{\pi_e}^S(h)\|_{2,\Sigma_F^{-1}}\leq C_{F,V} \] - **L∞ outcome coverage assumption**: \[ \|(\Sigma_R^{F,h})^{-1}\mathbf{V}_{\pi_e}^S\|_\infty\leq C_{F,\infty} \] - **Belief coverage assumption**: \[ \|b_{\pi_e}^h\|_{2,\Sigma_H^{-1}}\leq C_{H,2} \] Through these assumptions and improved methods, the paper successfully solves the exponential - dependence problem of off - policy evaluation in POMDPs and provides new directions and tools for future research.