Abstract:Motivated by engineering applications such as resource allocation in networks and inventory systems, we consider average-reward Reinforcement Learning with unbounded state space and reward function. Recent works studied this problem in the actor-critic framework and established finite sample bounds assuming access to a critic with certain error guarantees. We complement their work by studying Temporal Difference (TD) learning with linear function approximation and establishing finite-time bounds with the optimal $\mathcal{O}\left(1/\epsilon^2\right)$ sample complexity. These results are obtained using the following general-purpose theorem for non-linear Stochastic Approximation (SA). Suppose that one constructs a Lyapunov function for a non-linear SA with certain drift condition. Then, our theorem establishes finite-time bounds when this SA is driven by unbounded Markovian noise under suitable conditions. It serves as a black box tool to generalize sample guarantees on SA from i.i.d. or martingale difference case to potentially unbounded Markovian noise. The generality and the mild assumption of the setup enables broad applicability of our theorem. We illustrate its power by studying two more systems: (i) We improve upon the finite-time bounds of $Q$-learning by tightening the error bounds and also allowing for a larger class of behavior policies. (ii) We establish the first ever finite-time bounds for distributed stochastic optimization of high-dimensional smooth strongly convex function using cyclic block coordinate descent.

What problem does this paper attempt to address?

### Problems the paper attempts to solve This paper aims to solve the finite - sample performance problem of average - reward Reinforcement Learning (RL) under unbounded state spaces and reward functions. Specifically, the author focuses on the following two main challenges: 1. **Unbounded state spaces and reward functions**: Most of the existing literature assumes that the state space is finite or the reward is bounded when analyzing RL algorithms. However, in practical applications, such as resource allocation networks and inventory systems, the state space may be infinite, and the rewards may tend to infinity as the state increases. This makes the existing theoretical results difficult to be directly applied to these scenarios. 2. **Finite - sample complexity**: Although the asymptotic convergence of RL algorithms can be proved in some cases, their performance within a finite time (i.e., finite - sample complexity) has not been fully understood in the unbounded setting. Finite - sample complexity is crucial for evaluating the performance of RL algorithms in practical applications. To solve these problems, the author studies Temporal Difference (TD) learning with linear function approximation and establishes its finite - time convergence bound under unbounded Markov noise. In addition, the author also proposes a general theorem for dealing with the unbounded Markov noise problem in nonlinear Stochastic Approximation (SA). This general theorem can be used as a black - box tool to generalize the results of SA from the independent and identically distributed (i.i.d.) or martingale difference cases to the potentially unbounded Markov noise cases. ### Main contributions 1. **Performance of TD - Learning under unbounded state spaces and rewards**: - Analyzes the linear function approximation of the average - reward TD(λ) algorithm under asynchronous updates. - Establishes the first known finite - time convergence bound and shows the optimal $O\left(\frac{1}{k}\right)$ convergence rate, thus achieving a sample complexity of $O\left(\frac{1}{\epsilon^2}\right)$ when the step size is appropriately chosen. - Also proves almost sure (a.s.) convergence by projecting the iterations onto the appropriate subspace. 2. **Finite - time convergence guarantee of SA with unbounded Markov noise**: - Proposes a general theorem applicable to dealing with nonlinear SA driven by unbounded Markov noise. - This theorem enables us to generalize any SA result to the case of unbounded Markov noise, thus expanding the application scope of the existing literature. 3. **Methodological contributions**: - Uses the solution of the Poisson equation to analyze Markov noise instead of relying on the geometric mixing properties of Markov chains. - This method is not only more elegant but also can give tighter bounds (in terms of logarithmic factors) and allows a larger class of Markov chains (such as periodic chains). 4. **Performance of the Q - learning algorithm**: - Considers Q - learning in the discounted setting in the case of finite - state Markov noise. - Immediately obtains the finite - sample bound of Q - learning by using the proposed black - box tool and improves the existing results, including relaxing the requirements of the behavior policy. 5. **Performance of Stochastic Cyclic Block Coordinate Descent (CBCD)**: - Studies the distributed stochastic optimization problem of high - dimensional smooth and strongly convex functions. - Proposes a new perspective to deal with the periodicity of updates, regards each block as the state of a periodic Markov chain, and immediately gives the optimal $O\left(\frac{1}{k}\right)$ convergence rate. ### Summary This paper solves the finite - sample performance problem of average - reward reinforcement learning under unbounded state spaces and reward functions by introducing new mathematical tools and methods. These achievements not only fill the theoretical gap but also provide a solid theoretical basis for practical applications.

Stochastic Approximation with Unbounded Markovian Noise: A General-Purpose Theorem

Tight Finite Time Bounds of Two-Time-Scale Linear Stochastic Approximation with Markovian Noise

A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation

The Curse of Memory in Stochastic Approximation: Extended Version

Finite-Time Error Bounds of Biased Stochastic Approximation With Application to TD-Learning

Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling

The ODE Method for Stochastic Approximation and Reinforcement Learning with Markovian Noise

Asynchronous Stochastic Approximation and Average-Reward Reinforcement Learning

Markovian Foundations for Quasi-Stochastic Approximation with Applications to Extremum Seeking Control

Central Limit Theorem for Two-Timescale Stochastic Approximation with Markovian Noise: Theory and Applications

Markovian Foundations for Quasi-Stochastic Approximation in Two Timescales: Extended Version

A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants

Optimal and instance-dependent guarantees for Markovian linear stochastic approximation

Stochastic Approximation Beyond Gradient for Signal Processing and Machine Learning

Stochastic approximation in infinite dimensions

Stochastic Approximation for Nonlinear Discrete Stochastic Control: Finite-Sample Bounds

Convergence of Batch Asynchronous Stochastic Approximation With Applications to Reinforcement Learning

Almost Sure Convergence Rates and Concentration of Stochastic Approximation and Reinforcement Learning with Markovian Noise

Finite-Time Error Bounds For Linear Stochastic Approximation and TD Learning

A Stochastic Approximation Framework for a Class of Randomized Optimization Algorithms

Formalization of a Stochastic Approximation Theorem