Online Markov decision processes with policy iteration

Yao Ma,Hao Zhang,Masashi Sugiyama
DOI: https://doi.org/10.48550/arXiv.1510.04454
2015-10-15
Abstract:The online Markov decision process (MDP) is a generalization of the classical Markov decision process that incorporates changing reward functions. In this paper, we propose practical online MDP algorithms with policy iteration and theoretically establish a sublinear regret bound. A notable advantage of the proposed algorithm is that it can be easily combined with function approximation, and thus large and possibly continuous state spaces can be efficiently handled. Through experiments, we demonstrate the usefulness of the proposed algorithm.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the policy optimization problem in the online Markov decision process (online MDP), especially when the reward function changes over time. Specifically, the paper aims to develop a new algorithm to efficiently solve the online MDP problem with a large - scale state space and ensure its performance is close to the optimal offline policy. ### Problem Background The traditional Markov decision process (MDP) assumes that the reward function is fixed, but in many practical application scenarios, the reward function may change over time. Therefore, an online learning algorithm that can adapt to this change needs to be designed. The goal of online MDP is to select a policy at each time step to maximize the cumulative reward and minimize the regret relative to the optimal offline policy. ### Core Contributions of the Paper 1. **Proposed an online MDP algorithm based on policy iteration (OMDP - PI)** - This algorithm updates the current policy through policy improvement operations at each time step. - The time complexity of the algorithm is low, which is suitable for dealing with problems with large - scale state spaces. 2. **Theoretical Analysis** - It is proved that the proposed OMDP - PI algorithm can achieve a sublinear regret bound under certain conditions, that is, as the number of time steps \( T \) increases, the growth rate of regret is slower than linear. 3. **Extension to Continuous State Space** - A method combined with linear function approximation is proposed, enabling the algorithm to handle problems in continuous state spaces. 4. **Experimental Verification** - The effectiveness of the algorithm is demonstrated through grid - world experiments, verifying the correctness of the theoretical analysis. ### Mathematical Formulas To better understand the content of the paper, the following are some key formulas involved in the paper: - **Value Function Update Rule** \[ V_t(s)=(1 - \gamma_t)V_{t - 1}(s)+\gamma_t V^{\pi_t}_{r_t}(s) \] where \(\gamma_t=\frac{1}{t}\) is the step - size parameter. - **Policy Improvement Operation** \[ \pi_t = \Gamma(\hat{r}_{t - 1},V_{t - 1}) \] where \(\hat{r}_{t - 1}(s,a)=\frac{1}{t - 1}\sum_{k = 1}^{t - 1}r_k(s,a)\) is the average of historical rewards. - **Mathematical Expression of Regret Bound** \[ L_{\text{OMDP - PI}}(T)\leq\frac{2 - e^{-1/\tau}}{1 - e^{-1/\tau}}C\xi T C_v+\left(6\tau\xi\left(\frac{2 - e^{-1/\tau}}{1 - e^{-1/\tau}}\right)+2\tau^3\right)\ln T+\left(6\tau\xi\left(\frac{2 - e^{-1/\tau}}{1 - e^{-1/\tau}}\right)+2\tau^3+2\tau^3 e^{\tau + 2}+4\tau\right) \] These formulas and theoretical analyses jointly support the effectiveness and superiority of the algorithm proposed in the paper.