Abstract:This paper analyzes reinforcement learning (RL) algorithms for Markov decision processes (MDPs) under the average-reward criterion. We focus on Q-learning algorithms based on relative value iteration (RVI), which are model-free stochastic analogues of the classical RVI method for average-reward MDPs. These algorithms have low per-iteration complexity, making them well-suited for large state space problems. We extend the almost-sure convergence analysis of RVI Q-learning algorithms developed by Abounadi, Bertsekas, and Borkar (2001) from unichain to weakly communicating MDPs. This extension is important both practically and theoretically: weakly communicating MDPs cover a much broader range of applications compared to unichain MDPs, and their optimality equations have a richer solution structure (with multiple degrees of freedom), introducing additional complexity in proving algorithmic convergence. We also characterize the sets to which RVI Q-learning algorithms converge, showing that they are compact, connected, potentially nonconvex, and comprised of solutions to the average-reward optimality equation, with exactly one less degree of freedom than the general solution set of this equation. Furthermore, we extend our analysis to two RVI-based hierarchical average-reward RL algorithms using the options framework, proving their almost-sure convergence and characterizing their sets of convergence under the assumption that the underlying semi-Markov decision process is weakly communicating.

RVI reinforcement learning for semi-Markov decision processes with average reward

On Convergence of Average-Reward Q-Learning in Weakly Communicating Markov Decision Processes

Study on an Average Reward Reinforcement Learning Algorithm

Model-Free Robust Average-Reward Reinforcement Learning

Asynchronous Stochastic Approximation and Average-Reward Reinforcement Learning

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

RVI-SAC: Average Reward Off-Policy Deep Reinforcement Learning

Provably Efficient Infinite-Horizon Average-Reward Reinforcement Learning with Linear Function Approximation

Relative Q-Learning for Average-Reward Markov Decision Processes with Continuous States

Hierarchical Average-Reward Linearly-solvable Markov Decision Processes

Average-Reward Reinforcement Learning with Trust Region Methods

Burning RED: Unlocking Subtask-Driven Reinforcement Learning and Risk-Awareness in Average-Reward Markov Decision Processes

Beyond discounted returns: Robust Markov decision processes with average and Blackwell optimality

Mean-Semivariance Policy Optimization via Risk-Averse Reinforcement Learning

Robust Risk-Sensitive Reinforcement Learning with Conditional Value-at-Risk

Semi-Infinitely Constrained Markov Decision Processes and Provably Efficient Reinforcement Learning.

Efficient Average Reward Reinforcement Learning Using Constant Shifting Values.

Semi-Infinitely Constrained Markov Decision Processes and Efficient Reinforcement Learning

Optimizing the Long-Term Average Reward for Continuing MDPs: A Technical Report

Robust Average-Reward Markov Decision Processes

Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms