Abstract:The online Markov decision process (MDP) is a generalization of the classical Markov decision process that incorporates changing reward functions. In this paper, we propose practical online MDP algorithms with policy iteration and theoretically establish a sublinear regret bound. A notable advantage of the proposed algorithm is that it can be easily combined with function approximation, and thus large and possibly continuous state spaces can be efficiently handled. Through experiments, we demonstrate the usefulness of the proposed algorithm.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the policy optimization problem in the online Markov decision process (online MDP), especially when the reward function changes over time. Specifically, the paper aims to develop a new algorithm to efficiently solve the online MDP problem with a large - scale state space and ensure its performance is close to the optimal offline policy. ### Problem Background The traditional Markov decision process (MDP) assumes that the reward function is fixed, but in many practical application scenarios, the reward function may change over time. Therefore, an online learning algorithm that can adapt to this change needs to be designed. The goal of online MDP is to select a policy at each time step to maximize the cumulative reward and minimize the regret relative to the optimal offline policy. ### Core Contributions of the Paper 1. **Proposed an online MDP algorithm based on policy iteration (OMDP - PI)** - This algorithm updates the current policy through policy improvement operations at each time step. - The time complexity of the algorithm is low, which is suitable for dealing with problems with large - scale state spaces. 2. **Theoretical Analysis** - It is proved that the proposed OMDP - PI algorithm can achieve a sublinear regret bound under certain conditions, that is, as the number of time steps \( T \) increases, the growth rate of regret is slower than linear. 3. **Extension to Continuous State Space** - A method combined with linear function approximation is proposed, enabling the algorithm to handle problems in continuous state spaces. 4. **Experimental Verification** - The effectiveness of the algorithm is demonstrated through grid - world experiments, verifying the correctness of the theoretical analysis. ### Mathematical Formulas To better understand the content of the paper, the following are some key formulas involved in the paper: - **Value Function Update Rule** \[ V_t(s)=(1 - \gamma_t)V_{t - 1}(s)+\gamma_t V^{\pi_t}_{r_t}(s) \] where \(\gamma_t=\frac{1}{t}\) is the step - size parameter. - **Policy Improvement Operation** \[ \pi_t = \Gamma(\hat{r}_{t - 1},V_{t - 1}) \] where \(\hat{r}_{t - 1}(s,a)=\frac{1}{t - 1}\sum_{k = 1}^{t - 1}r_k(s,a)\) is the average of historical rewards. - **Mathematical Expression of Regret Bound** \[ L_{\text{OMDP - PI}}(T)\leq\frac{2 - e^{-1/\tau}}{1 - e^{-1/\tau}}C\xi T C_v+\left(6\tau\xi\left(\frac{2 - e^{-1/\tau}}{1 - e^{-1/\tau}}\right)+2\tau^3\right)\ln T+\left(6\tau\xi\left(\frac{2 - e^{-1/\tau}}{1 - e^{-1/\tau}}\right)+2\tau^3+2\tau^3 e^{\tau + 2}+4\tau\right) \] These formulas and theoretical analyses jointly support the effectiveness and superiority of the algorithm proposed in the paper.

Online Markov decision processes with policy iteration

Online Reinforcement Learning in Markov Decision Process Using Linear Programming

Dynamic Regret of Online Markov Decision Processes

Online Markov Decision Processes with Non-Oblivious Strategic Adversary

Blackwell Online Learning for Markov Decision Processes

Online Policy Optimization for Robust MDP

Robust Batch Policy Learning in Markov Decision Processes

Efficient Policy Iteration for Robust Markov Decision Processes via Regularization

Acting in Delayed Environments with Non-Stationary Markov Policies

√N-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank.

Approximate Policy Iteration for Robust Stochastic Control of Multi-agent Markov Decision Processes

A Theoretical Analysis of Optimistic Proximal Policy Optimization in Linear Markov Decision Processes

Fast Online Exact Solutions for Deterministic MDPs with Sparse Rewards

Robust Anytime Learning of Markov Decision Processes

A Provably Efficient Algorithm for Linear Markov Decision Process with Low Switching Cost

Narrowing the Gap between Adversarial and Stochastic MDPs via Policy Optimization

$\Sqrt{n}$-Regret for Learning in Markov Decision Processes with Function Approximation and Low Bellman Rank

Optimal Policies for Quantum Markov Decision Processes

Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space

Robust Average-Reward Markov Decision Processes

Online Reinforcement Learning for Real-Time Exploration in Continuous State and Action Markov Decision Processes