Reinforcement Learning: Stochastic Approximation Algorithms for Markov Decision Processes

Vikram Krishnamurthy

DOI: https://doi.org/10.48550/arXiv.1512.07669

2015-12-24

Abstract:This article presents a short and concise description of stochastic approximation algorithms in reinforcement learning of Markov decision processes. The algorithms can also be used as a suboptimal method for partially observed Markov decision processes.

Optimization and Control

What problem does this paper attempt to address?

Asynchronous Stochastic Approximation and Average-Reward Reinforcement Learning

Huizhen Yu,Yi Wan,Richard S. Sutton

2024-09-06

Abstract:This paper studies asynchronous stochastic approximation (SA) algorithms and their application to reinforcement learning in semi-Markov decision processes (SMDPs) with an average-reward criterion. We first extend Borkar and Meyn's stability proof method to accommodate more general noise conditions, leading to broader convergence guarantees for asynchronous SA algorithms. Leveraging these results, we establish the convergence of an asynchronous SA analogue of Schweitzer's classical relative value iteration algorithm, RVI Q-learning, for finite-space, weakly communicating SMDPs. Furthermore, to fully utilize the SA results in this application, we introduce new monotonicity conditions for estimating the optimal reward rate in RVI Q-learning. These conditions substantially expand the previously considered algorithmic framework, and we address them with novel proof arguments in the stability and convergence analysis of RVI Q-learning.

Machine Learning,Optimization and Control
A Tutorial Introduction to Reinforcement Learning

Mathukumalli Vidyasagar

2023-04-03

Abstract:In this paper, we present a brief survey of Reinforcement Learning (RL), with particular emphasis on Stochastic Approximation (SA) as a unifying theme. The scope of the paper includes Markov Reward Processes, Markov Decision Processes, Stochastic Approximation algorithms, and widely used algorithms such as Temporal Difference Learning and $Q$-learning.

Machine Learning,Systems and Control
Stochastic Approximation with Unbounded Markovian Noise: A General-Purpose Theorem

Shaan Ul Haque,Siva Theja Maguluri

2024-10-29

Abstract:Motivated by engineering applications such as resource allocation in networks and inventory systems, we consider average-reward Reinforcement Learning with unbounded state space and reward function. Recent works studied this problem in the actor-critic framework and established finite sample bounds assuming access to a critic with certain error guarantees. We complement their work by studying Temporal Difference (TD) learning with linear function approximation and establishing finite-time bounds with the optimal $\mathcal{O}\left(1/\epsilon^2\right)$ sample complexity. These results are obtained using the following general-purpose theorem for non-linear Stochastic Approximation (SA). Suppose that one constructs a Lyapunov function for a non-linear SA with certain drift condition. Then, our theorem establishes finite-time bounds when this SA is driven by unbounded Markovian noise under suitable conditions. It serves as a black box tool to generalize sample guarantees on SA from i.i.d. or martingale difference case to potentially unbounded Markovian noise. The generality and the mild assumption of the setup enables broad applicability of our theorem. We illustrate its power by studying two more systems: (i) We improve upon the finite-time bounds of $Q$-learning by tightening the error bounds and also allowing for a larger class of behavior policies. (ii) We establish the first ever finite-time bounds for distributed stochastic optimization of high-dimensional smooth strongly convex function using cyclic block coordinate descent.

Machine Learning,Systems and Control,Optimization and Control
Markov Abstractions for PAC Reinforcement Learning in Non-Markov Decision Processes

Alessandro Ronca,Gabriel Paludo Licks,Giuseppe De Giacomo

DOI: https://doi.org/10.48550/arXiv.2205.01053

2022-05-18

Abstract:Our work aims at developing reinforcement learning algorithms that do not rely on the Markov assumption. We consider the class of Non-Markov Decision Processes where histories can be abstracted into a finite set of states while preserving the dynamics. We call it a Markov abstraction since it induces a Markov Decision Process over a set of states that encode the non-Markov dynamics. This phenomenon underlies the recently introduced Regular Decision Processes (as well as POMDPs where only a finite number of belief states is reachable). In all such kinds of decision process, an agent that uses a Markov abstraction can rely on the Markov property to achieve optimal behaviour. We show that Markov abstractions can be learned during reinforcement learning. Our approach combines automata learning and classic reinforcement learning. For these two tasks, standard algorithms can be employed. We show that our approach has PAC guarantees when the employed algorithms have PAC guarantees, and we also provide an experimental evaluation.

Machine Learning,Artificial Intelligence
Approximating Euclidean by Imprecise Markov Decision Processes

Manfred Jaeger,Giorgio Bacci,Giovanni Bacci,Kim Guldstrand Larsen,Peter Gjøl Jensen

DOI: https://doi.org/10.48550/arXiv.2006.14923

2020-06-26

Abstract:Euclidean Markov decision processes are a powerful tool for modeling control problems under uncertainty over continuous domains. Finite state imprecise, Markov decision processes can be used to approximate the behavior of these infinite models. In this paper we address two questions: first, we investigate what kind of approximation guarantees are obtained when the Euclidean process is approximated by finite state approximations induced by increasingly fine partitions of the continuous state space. We show that for cost functions over finite time horizons the approximations become arbitrarily precise. Second, we use imprecise Markov decision process approximations as a tool to analyse and validate cost functions and strategies obtained by reinforcement learning. We find that, on the one hand, our new theoretical results validate basic design choices of a previously proposed reinforcement learning approach. On the other hand, the imprecise Markov decision process approximations reveal some inaccuracies in the learned cost functions.

Artificial Intelligence
The ODE Method for Stochastic Approximation and Reinforcement Learning with Markovian Noise

Shuze Liu,Shuhang Chen,Shangtong Zhang

2024-07-11

Abstract:Stochastic approximation is a class of algorithms that update a vector iteratively, incrementally, and stochastically, including, e.g., stochastic gradient descent and temporal difference learning. One fundamental challenge in analyzing a stochastic approximation algorithm is to establish its stability, i.e., to show that the stochastic vector iterates are bounded almost surely. In this paper, we extend the celebrated Borkar-Meyn theorem for stability from the Martingale difference noise setting to the Markovian noise setting, which greatly improves its applicability in reinforcement learning, especially in those off-policy reinforcement learning algorithms with linear function approximation and eligibility traces. Central to our analysis is the diminishing asymptotic rate of change of a few functions, which is implied by both a form of strong law of large numbers and a commonly used V4 Lyapunov drift condition and trivially holds if the Markov chain is finite and irreducible.

Machine Learning,Artificial Intelligence
Simultaneously Learning Stochastic and Adversarial Markov Decision Process with Linear Function Approximation

Fang Kong,XiangCheng Zhang,Baoxiang Wang,Shuai Li

2023-01-01

Abstract:Reinforcement learning (RL) has been commonly used in practice. To deal with the numerous states and actions in real applications, the function approximation method has been widely employed to improve the learning efficiency, among which the linear function approximation has attracted great interest both theoretically and empirically. Previous works on the linear Markov Decision Process (MDP) mainly study two settings, the stochastic setting where the reward is generated in a stochastic way and the adversarial setting where the reward can be chosen arbitrarily by an adversary. All these works treat these two environments separately. However, the learning agents often have no idea of how rewards are generated and a wrong reward type can severely disrupt the performance of those specially designed algorithms. So a natural question is whether an algorithm can be derived that can efficiently learn in both environments but without knowing the reward type. In this paper, we first consider such best-of-both-worlds problem for linear MDP with the known transition. We propose an algorithm and prove it can simultaneously achieve $O(\text{poly} \log K)$ regret in the stochastic setting and $O(\sqrt{K})$ regret in the adversarial setting where $K$ is the horizon. To the best of our knowledge, it is the first such result for linear MDP.
Reinforcement learning algorithms for semi-Markov decision processes with average reward

Yanjie Li

DOI: https://doi.org/10.1109/ICNSC.2012.6204909

2012-01-01

Abstract:In this paper, we study reinforcement learning (RL) algorithms based on a perspective of performance sensitivity analysis for SMDPs with average reward. We present the results about performance sensitivity analysis for SMDPs with average reward. On these bases, two RL algorithms for average-reward SMDPs are studied. One algorithm is the relative value iteration (RVI) RL algorithm, which avoids the estimation of optimal average reward in the process of learning. Another algorithm is a policy gradient estimation algorithm, which extends the policy gradient estimation algorithm for discrete time Markov decision processes (MDPs) to SMDPs and only requires half storage of the existing algorithm.
Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Qiaomin Xie,Zihan Zhang

DOI: https://doi.org/10.48550/arXiv.2306.16394

2023-06-28

Abstract:We develop several provably efficient model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov Decision Processes (MDPs). We consider both online setting and the setting with access to a simulator. In the online setting, we propose model-free RL algorithms based on reference-advantage decomposition. Our algorithm achieves $\widetilde{O}(S^5A^2\mathrm{sp}(h^*)\sqrt{T})$ regret after $T$ steps, where $S\times A$ is the size of state-action space, and $\mathrm{sp}(h^*)$ the span of the optimal bias function. Our results are the first to achieve optimal dependence in $T$ for weakly communicating MDPs. In the simulator setting, we propose a model-free RL algorithm that finds an $\epsilon$-optimal policy using $\widetilde{O} \left(\frac{SA\mathrm{sp}^2(h^*)}{\epsilon^2}+\frac{S^2A\mathrm{sp}(h^*)}{\epsilon} \right)$ samples, whereas the minimax lower bound is $\Omega\left(\frac{SA\mathrm{sp}(h^*)}{\epsilon^2}\right)$. Our results are based on two new techniques that are unique in the average-reward setting: 1) better discounted approximation by value-difference estimation; 2) efficient construction of confidence region for the optimal bias function with space complexity $O(SA)$.

Computer Science
Stochastic Variance-Reduced Policy Gradient

Matteo Papini,Damiano Binaghi,Giuseppe Canonaco,Matteo Pirotta,Marcello Restelli

DOI: https://doi.org/10.48550/arXiv.1806.05618

2018-06-14

Abstract:In this paper, we propose a novel reinforcement- learning algorithm consisting in a stochastic variance-reduced version of policy gradient for solving Markov Decision Processes (MDPs). Stochastic variance-reduced gradient (SVRG) methods have proven to be very successful in supervised learning. However, their adaptation to policy gradient is not straightforward and needs to account for I) a non-concave objective func- tion; II) approximations in the full gradient com- putation; and III) a non-stationary sampling pro- cess. The result is SVRPG, a stochastic variance- reduced policy gradient algorithm that leverages on importance weights to preserve the unbiased- ness of the gradient estimate. Under standard as- sumptions on the MDP, we provide convergence guarantees for SVRPG with a convergence rate that is linear under increasing batch sizes. Finally, we suggest practical variants of SVRPG, and we empirically evaluate them on continuous MDPs.

Machine Learning
Approximate Policy Iteration for Robust Stochastic Control of Multi-agent Markov Decision Processes

Feng Huang,Ming Cao,Long Wang

DOI: https://doi.org/10.1109/tac.2024.3510596

IF: 6.549

2024-01-01

IEEE Transactions on Automatic Control

Abstract:In stochastic dynamic environments, multi-agent Markov decision processes have emerged as a versatile paradigm for studying sequential decision-making problems of fully cooperative multi-agent systems. However, the optimality of the derived policies is usually sensitive to model parameters, which are typically unknown and required to be estimated from noisy data in practice. To investigate the sensitivity of optimal policies to these uncertain parameters, we study a robust stochastic control problem of multi-agent Markov decision processes where all agents constitute a centralized controller whose goal is to seek a maximal long-term return of all agents and the uncertainty plays a role of disturbance for achieving this goal, and provide a solution concept of robust team optimality for decisions of all agents. To seek such a solution, we develop a robust iterative learning algorithm of policies for all agents and present its convergence analysis. This algorithm, compared with robust dynamic programming, not only possesses a faster convergence rate, but also allows for using approximation calculations to alleviate required computational resources. Moreover, some numerical simulations are presented to demonstrate the effectiveness of the algorithm by extending the model of sequential social dilemmas to uncertain scenarios.
Safe Reinforcement Learning for Constrained Markov Decision Processes with Stochastic Stopping Time

Abhijit Mazumdar,Rafal Wisniewski,Manuela L. Bujorianu

2024-03-24

Abstract:In this paper, we present an online reinforcement learning algorithm for constrained Markov decision processes with a safety constraint. Despite the necessary attention of the scientific community, considering stochastic stopping time, the problem of learning optimal policy without violating safety constraints during the learning phase is yet to be addressed. To this end, we propose an algorithm based on linear programming that does not require a process model. We show that the learned policy is safe with high confidence. We also propose a method to compute a safe baseline policy, which is central in developing algorithms that do not violate the safety constraints. Finally, we provide simulation results to show the efficacy of the proposed algorithm. Further, we demonstrate that efficient exploration can be achieved by defining a subset of the state-space called proxy set.

Machine Learning,Optimization and Control
Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling

Arman Adibi,Nicolo Dal Fabbro,Luca Schenato,Sanjeev Kulkarni,H. Vincent Poor,George J. Pappas,Hamed Hassani,Aritra Mitra

2024-03-27

Abstract:Motivated by applications in large-scale and multi-agent reinforcement learning, we study the non-asymptotic performance of stochastic approximation (SA) schemes with delayed updates under Markovian sampling. While the effect of delays has been extensively studied for optimization, the manner in which they interact with the underlying Markov process to shape the finite-time performance of SA remains poorly understood. In this context, our first main contribution is to show that under time-varying bounded delays, the delayed SA update rule guarantees exponentially fast convergence of the \emph{last iterate} to a ball around the SA operator's fixed point. Notably, our bound is \emph{tight} in its dependence on both the maximum delay $\tau_{max}$, and the mixing time $\tau_{mix}$. To achieve this tight bound, we develop a novel inductive proof technique that, unlike various existing delayed-optimization analyses, relies on establishing uniform boundedness of the iterates. As such, our proof may be of independent interest. Next, to mitigate the impact of the maximum delay on the convergence rate, we provide the first finite-time analysis of a delay-adaptive SA scheme under Markovian sampling. In particular, we show that the exponent of convergence of this scheme gets scaled down by $\tau_{avg}$, as opposed to $\tau_{max}$ for the vanilla delayed SA rule; here, $\tau_{avg}$ denotes the average delay across all iterations. Moreover, the adaptive scheme requires no prior knowledge of the delay sequence for step-size tuning. Our theoretical findings shed light on the finite-time effects of delays for a broad class of algorithms, including TD learning, Q-learning, and stochastic gradient descent under Markovian sampling.

Machine Learning,Artificial Intelligence,Multiagent Systems,Systems and Control,Optimization and Control
Stochastic Principal-Agent Problems: Efficient Computation and Learning

Jiarui Gan,Rupak Majumdar,Debmalya Mandal,Goran Radanovic

2024-09-12

Abstract:We introduce a stochastic principal-agent model. A principal and an agent interact in a stochastic environment, each privy to observations about the state not available to the other. The principal has the power of commitment, both to elicit information from the agent and to provide signals about her own information. The players communicate with each other and then select actions independently. Each of them receives a payoff based on the state and their joint action, and the environment transitions to a new state. The interaction continues over a finite time horizon. Both players are far-sighted, aiming to maximize their total payoffs over the time horizon. The model encompasses as special cases extensive-form games (EFGs) and stochastic games of incomplete information, partially observable Markov decision processes (POMDPs), as well as other forms of sequential principal-agent interactions, including Bayesian persuasion and automated mechanism design problems. We consider both the computation and learning of the principal's optimal policy. Since the general problem, which subsumes POMDPs, is intractable, we explore algorithmic solutions under hindsight observability, where the state and the interaction history are revealed at the end of each step. Though the problem becomes more amenable under this condition, the number of possible histories remains exponential in the length of the time horizon, making approaches for EFG-based models infeasible. We present an efficient algorithm based on the inducible value sets. The algorithm computes an $\epsilon$-approximate optimal policy in time polynomial in $1/\epsilon$. Additionally, we show an efficient learning algorithm for an episodic reinforcement learning setting where the transition probabilities are unknown. The algorithm guarantees sublinear regret $\tilde{O}(T^{2/3})$ for both players over $T$ episodes.

Computer Science and Game Theory,Machine Learning,Multiagent Systems
Nonstationary Reinforcement Learning with Linear Function Approximation

Huozhi Zhou,Jinglin Chen,Lav R. Varshney,Ashish Jagmohan

2024-04-13

Abstract:We consider reinforcement learning (RL) in episodic Markov decision processes (MDPs) with linear function approximation under drifting environment. Specifically, both the reward and state transition functions can evolve over time but their total variations do not exceed a $\textit{variation budget}$. We first develop $\texttt{LSVI-UCB-Restart}$ algorithm, an optimistic modification of least-squares value iteration with periodic restart, and bound its dynamic regret when variation budgets are known. Then we propose a parameter-free algorithm $\texttt{Ada-LSVI-UCB-Restart}$ that extends to unknown variation budgets. We also derive the first minimax dynamic regret lower bound for nonstationary linear MDPs and as a byproduct establish a minimax regret lower bound for linear MDPs unsolved by Jin et al. (2020). Finally, we provide numerical experiments to demonstrate the effectiveness of our proposed algorithms.

Machine Learning
Federated Stochastic Approximation under Markov Noise and Heterogeneity: Applications in Reinforcement Learning

Sajad Khodadadian,Pranay Sharma,Gauri Joshi,Siva Theja Maguluri

2024-10-21

Abstract:Since reinforcement learning algorithms are notoriously data-intensive, the task of sampling observations from the environment is usually split across multiple agents. However, transferring these observations from the agents to a central location can be prohibitively expensive in terms of communication cost, and it can also compromise the privacy of each agent's local behavior policy. Federated reinforcement learning is a framework in which $N$ agents collaboratively learn a global model, without sharing their individual data and policies. This global model is the unique fixed point of the average of $N$ local operators, corresponding to the $N$ agents. Each agent maintains a local copy of the global model and updates it using locally sampled data. In this paper, we show that by careful collaboration of the agents in solving this joint fixed point problem, we can find the global model $N$ times faster, also known as linear speedup. We first propose a general framework for federated stochastic approximation with Markovian noise and heterogeneity, showing linear speedup in convergence. We then apply this framework to federated reinforcement learning algorithms, examining the convergence of federated on-policy TD, off-policy TD, and $Q$-learning.

Machine Learning
Actively Learning Reinforcement Learning: A Stochastic Optimal Control Approach

Mohammad S. Ramadan,Mahmoud A. Hayajnh,Michael T. Tolley,Kyriakos G. Vamvoudakis

2024-02-27

Abstract:In this paper we propose a framework towards achieving two intertwined objectives: (i) equipping reinforcement learning with active exploration and deliberate information gathering, such that it regulates state and parameter uncertainties resulting from modeling mismatches and noisy sensory; and (ii) overcoming the huge computational cost of stochastic optimal control. We approach both objectives by using reinforcement learning to attain the stochastic optimal control law. On one hand, we avoid the curse of dimensionality prohibiting the direct solution of the stochastic dynamic programming equation. On the other hand, the resulting stochastic control inspired reinforcement learning agent admits the behavior of a dual control, namely, caution and probing, that is, regulating the state estimate together with its estimation quality. Unlike exploration and exploitation, caution and probing are employed automatically by the controller in real-time, even after the learning process is concluded. We use the proposed approach on a numerical example of a model that belongs to an emerging class in system identification. We show how, for the dimensionality of the stochastic version of this model, Dynamic Programming is prohibitive, Model Predictive Control requires an expensive nonlinear optimization, and a Linear Quadratic Regulator with the certainty equivalence assumption leads to poor performance and filter divergence, all contrasting our approach which is shown to be both: computationally convenient, stabilizing and of an acceptable performance.

Machine Learning,Systems and Control
Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

Dongruo Zhou,Quanquan Gu,Csaba Szepesvari

DOI: https://doi.org/10.48550/arXiv.2012.08507

IF: 5.414

2020-12-15

Machine Learning

Abstract:We study reinforcement learning (RL) with linear function approximation where the underlying transition probability kernel of the Markov decision process (MDP) is a linear mixture model (Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2020) and the learning agent has access to either an integration or a sampling oracle of the individual basis kernels. We propose a new Bernstein-type concentration inequality for self-normalized martingales for linear bandit problems with bounded noise. Based on the new inequality, we propose a new, computationally efficient algorithm with linear function approximation named $\text{UCRL-VTR}^{+}$ for the aforementioned linear mixture MDPs in the episodic undiscounted setting. We show that $\text{UCRL-VTR}^{+}$ attains an $\tilde O(dH\sqrt{T})$ regret where $d$ is the dimension of feature mapping, $H$ is the length of the episode and $T$ is the number of interactions with the MDP. We also prove a matching lower bound $\Omega(dH\sqrt{T})$ for this setting, which shows that $\text{UCRL-VTR}^{+}$ is minimax optimal up to logarithmic factors. In addition, we propose the $\text{UCLK}^{+}$ algorithm for the same family of MDPs under discounting and show that it attains an $\tilde O(d\sqrt{T}/(1-\gamma)^{1.5})$ regret, where $\gamma\in [0,1)$ is the discount factor. Our upper bound matches the lower bound $\Omega(d\sqrt{T}/(1-\gamma)^{1.5})$ proved by Zhou et al. (2020) up to logarithmic factors, suggesting that $\text{UCLK}^{+}$ is nearly minimax optimal. To the best of our knowledge, these are the first computationally efficient, nearly minimax optimal algorithms for RL with linear function approximation.
Average Reward Reinforcement Learning For Semi-Markov Decision Processes

Jiayuan Yang,Yanjie Li,Haoyao Chen,Jiangang Li

DOI: https://doi.org/10.1007/978-3-319-70087-8_79

2017-01-01

Abstract:In this paper, we study new reinforcement learning (RL) algorithms for Semi-Markov decision processes (SMDPs) with an average reward criterion. Based on the discrete-time type Bellman optimality equation, we use incremental value iteration (IVI), stochastic shortest path (SSP) value iteration and bisection algorithms to derive novel RL algorithms in a straightforward way. These algorithms use IVI, SSP and dichotomy to directly estimate the optimal average reward to solve the instability of average reward RL, respectively. Furthermore, a simulation experiment is used to compare the convergence among these algorithms.
Average Reward Adjusted Discounted Reinforcement Learning: Near-Blackwell-Optimal Policies for Real-World Applications

Manuel Schneckenreither

DOI: https://doi.org/10.48550/arXiv.2004.00857

2020-04-02

Abstract:Although in recent years reinforcement learning has become very popular the number of successful applications to different kinds of operations research problems is rather scarce. Reinforcement learning is based on the well-studied dynamic programming technique and thus also aims at finding the best stationary policy for a given Markov Decision Process, but in contrast does not require any model knowledge. The policy is assessed solely on consecutive states (or state-action pairs), which are observed while an agent explores the solution space. The contributions of this paper are manifold. First we provide deep theoretical insights to the widely applied standard discounted reinforcement learning framework, which give rise to the understanding of why these algorithms are inappropriate when permanently provided with non-zero rewards, such as costs or profit. Second, we establish a novel near-Blackwell-optimal reinforcement learning algorithm. In contrary to former method it assesses the average reward per step separately and thus prevents the incautious combination of different types of state values. Thereby, the Laurent Series expansion of the discounted state values forms the foundation for this development and also provides the connection between the two approaches. Finally, we prove the viability of our algorithm on a challenging problem set, which includes a well-studied M/M/1 admission control queuing system. In contrast to standard discounted reinforcement learning our algorithm infers the optimal policy on all tested problems. The insights are that in the operations research domain machine learning techniques have to be adapted and advanced to successfully apply these methods in our settings.

Machine Learning

Reinforcement Learning: Stochastic Approximation Algorithms for Markov Decision Processes

Asynchronous Stochastic Approximation and Average-Reward Reinforcement Learning

A Tutorial Introduction to Reinforcement Learning

Stochastic Approximation with Unbounded Markovian Noise: A General-Purpose Theorem

Markov Abstractions for PAC Reinforcement Learning in Non-Markov Decision Processes

Approximating Euclidean by Imprecise Markov Decision Processes

The ODE Method for Stochastic Approximation and Reinforcement Learning with Markovian Noise

Simultaneously Learning Stochastic and Adversarial Markov Decision Process with Linear Function Approximation

Reinforcement learning algorithms for semi-Markov decision processes with average reward

Sharper Model-free Reinforcement Learning for Average-reward Markov Decision Processes

Stochastic Variance-Reduced Policy Gradient

Approximate Policy Iteration for Robust Stochastic Control of Multi-agent Markov Decision Processes

Safe Reinforcement Learning for Constrained Markov Decision Processes with Stochastic Stopping Time

Stochastic Approximation with Delayed Updates: Finite-Time Rates under Markovian Sampling

Stochastic Principal-Agent Problems: Efficient Computation and Learning

Nonstationary Reinforcement Learning with Linear Function Approximation

Federated Stochastic Approximation under Markov Noise and Heterogeneity: Applications in Reinforcement Learning

Actively Learning Reinforcement Learning: A Stochastic Optimal Control Approach

Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes

Average Reward Reinforcement Learning For Semi-Markov Decision Processes

Average Reward Adjusted Discounted Reinforcement Learning: Near-Blackwell-Optimal Policies for Real-World Applications