Abstract:This paper analyzes reinforcement learning (RL) algorithms for Markov decision processes (MDPs) under the average-reward criterion. We focus on Q-learning algorithms based on relative value iteration (RVI), which are model-free stochastic analogues of the classical RVI method for average-reward MDPs. These algorithms have low per-iteration complexity, making them well-suited for large state space problems. We extend the almost-sure convergence analysis of RVI Q-learning algorithms developed by Abounadi, Bertsekas, and Borkar (2001) from unichain to weakly communicating MDPs. This extension is important both practically and theoretically: weakly communicating MDPs cover a much broader range of applications compared to unichain MDPs, and their optimality equations have a richer solution structure (with multiple degrees of freedom), introducing additional complexity in proving algorithmic convergence. We also characterize the sets to which RVI Q-learning algorithms converge, showing that they are compact, connected, potentially nonconvex, and comprised of solutions to the average-reward optimality equation, with exactly one less degree of freedom than the general solution set of this equation. Furthermore, we extend our analysis to two RVI-based hierarchical average-reward RL algorithms using the options framework, proving their almost-sure convergence and characterizing their sets of convergence under the assumption that the underlying semi-Markov decision process is weakly communicating.

An Average-Reward Reinforcement Learning Algorithm Based on Schweitzer'S Transformation

Efficient Average Reward Reinforcement Learning Using Constant Shifting Values.

Average-Reward Reinforcement Learning with Trust Region Methods

Study on an Average Reward Reinforcement Learning Algorithm

On Convergence of Average-Reward Q-Learning in Weakly Communicating Markov Decision Processes

RVI-SAC: Average Reward Off-Policy Deep Reinforcement Learning

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Asynchronous Stochastic Approximation and Average-Reward Reinforcement Learning

An Improved Dyna-Q Algorithm Inspired by the Forward Prediction Mechanism in the Rat Brain for Mobile Robot Path Planning

An Incremental Optimization Approach to Address the Spatiotemporal Reward Coupling Effects in Deep Reinforcement Learning for Path Planning

Relative Q-Learning for Average-Reward Markov Decision Processes with Continuous States

Optimizing the Long-Term Average Reward for Continuing MDPs: A Technical Report

Model-Free Robust Average-Reward Reinforcement Learning

Average-reward model-free reinforcement learning: a systematic review and literature mapping

Off-Policy Deep Reinforcement Learning Based on Steffensen Value Iteration

Multiple Suboptimal Policies Integrated Reinforcement Learning Algorithm for Path Planning

An Efficient Deep Reinforcement Learning Algorithm for Mapless Navigation with Gap-Guided Switching Strategy

Autonomous Learning and Navigation of Mobile Robots Based on Deep Reinforcement Learning

Reward Shaping for Building Trustworthy Robots in Sequential Human-Robot Interaction

On-Policy Deep Reinforcement Learning for the Average-Reward Criterion

Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms