Abstract:In reinforcement learning (RL) , one of the key components is policy evaluation, which aims to estimate the value function (i.e., expected long-term accumulated reward) of a policy. With a good policy evaluation method, the RL algorithms will estimate the value function more accurately and find a better policy. When the state space is large or continuous \emph{Gradient-based Temporal Difference(GTD)} policy evaluation algorithms with linear function approximation are widely used. Considering that the collection of the evaluation data is both time and reward consuming, a clear understanding of the finite sample performance of the policy evaluation algorithms is very important to reinforcement learning. Under the assumption that data are i.i.d. generated, previous work provided the finite sample analysis of the GTD algorithms with constant step size by converting them into convex-concave saddle point problems. However, it is well-known that, the data are generated from Markov processes rather than i.i.d. in RL problems.. In this paper, in the realistic Markov setting, we derive the finite sample bounds for the general convex-concave saddle point problems, and hence for the GTD algorithms. We have the following discussions based on our bounds. (1) With variants of step size, GTD algorithms converge. (2) The convergence rate is determined by the step size, with the mixing time of the Markov process as the coefficient. The faster the Markov processes mix, the faster the convergence. (3) We explain that the experience replay trick is effective by improving the mixing property of the Markov process. To the best of our knowledge, our analysis is the first to provide finite sample bounds for the GTD algorithms in Markov setting.

Performance Bounds for Policy-Based Average Reward Reinforcement Learning Algorithms

On-Policy Deep Reinforcement Learning for the Average-Reward Criterion

Average-Reward Reinforcement Learning with Trust Region Methods

Learning Fair Policies in Multi-Objective (deep) Reinforcement Learning with Average and Discounted Rewards.

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Average Reward Adjusted Discounted Reinforcement Learning: Near-Blackwell-Optimal Policies for Real-World Applications

Reinforcement learning algorithms for semi-Markov decision processes with average reward

On the Global Convergence of Policy Gradient in Average Reward Markov Decision Processes

Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs via Approximation by Discounted-Reward MDPs

Offline Primal-Dual Reinforcement Learning for Linear MDPs

Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms

Provably Efficient Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs

Variance-Reduced Policy Gradient Approaches for Infinite Horizon Average Reward Markov Decision Processes

Policy Zooming: Adaptive Discretization-based Infinite-Horizon Average-Reward Reinforcement Learning

Hierarchical Average Reward Policy Gradient Algorithms

Finite Sample Analysis of the GTD Policy Evaluation Algorithms in Markov Setting

Efficient Average Reward Reinforcement Learning Using Constant Shifting Values.

Provable Policy Gradient Methods for Average-Reward Markov Potential Games

BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES

Performance of NPG in Countable State-Space Average-Cost RL

On the Performance Bounds of some Policy Search Dynamic Programming Algorithms