Policy Gradient Methods for Reinforcement Learning with Function Approximation and Action-Dependent Baselines

Philip S. Thomas,Emma Brunskill

DOI: https://doi.org/10.48550/arXiv.1706.06643

2017-06-21

Abstract:We show how an action-dependent baseline can be used by the policy gradient theorem using function approximation, originally presented with action-independent baselines by (Sutton et al. 2000).

Artificial Intelligence,Machine Learning

What problem does this paper attempt to address?

Gradient Q : A Unified Algorithm with Function Approximation for Reinforcement Learning

Long Yang,Yu Zhang,Qian Zheng,Pengfei Li,Gang Pan

2019-01-01

Abstract:Full-sampling (eg, Q-learning) and pure-expectation (eg, Expected Sarsa) algorithms are efficient and frequently used techniques in reinforcement learning. Q is the first approach unifies them with eligibility trace through the sampling degree . However, it is limited to the tabular case, for large-scale learning, the Q is too expensive to require a huge volume of tables to accurately storage value functions. To address above problem, we propose a GQ that extends tabular Q with linear function approximation. We prove the convergence of GQ . Empirical results on some standard domains show that GQ with a combination of full-sampling with pure-expectation reach a better performance than full-sampling and pure-expectation methods.
Optimal Control-Based Baseline for Guided Exploration in Policy Gradient Methods

Xubo Lyu,Site Li,Seth Siriya,Ye Pu,Mo Chen

2024-11-06

Abstract:In this paper, a novel optimal control-based baseline function is presented for the policy gradient method in deep reinforcement learning (RL). The baseline is obtained by computing the value function of an optimal control problem, which is formed to be closely associated with the RL task. In contrast to the traditional baseline aimed at variance reduction of policy gradient estimates, our work utilizes the optimal control value function to introduce a novel aspect to the role of baseline -- providing guided exploration during policy learning. This aspect is less discussed in prior works. We validate our baseline on robot learning tasks, showing its effectiveness in guided exploration, particularly in sparse reward environments.

Machine Learning,Artificial Intelligence,Robotics,Systems and Control
Effects of prenylated isoflavones osajin and pomiferin in premedication on heart ischemia-reperfusion.

T. Florian,J. Nečas,L. Bartošíková,J. Klusáková,V. Suchý,Elmoataz B El Naggara,E. Janoštíková,T. Bartošík

DOI: https://doi.org/10.5507/BP.2006.013

2006-07-01

Abstract:The present 15 days study was undertaken to evaluate the cardioprotective potential of the prenylated isoflavones osajin and pomiferin isolated from the infructences of Maclura pomifera, Moraceae, against ischemia-reperfusion induced injury in rat hearts as a model of antioxidant-based composite therapy. The study was performed on isolated, modified Langendorff-perfused rat hearts and the ischemia of heart was induced by stopping coronary flow for 30 min followed by 60 min of reperfusion (14 ml min(-1)). The Wistar rats were divided into four groups. The first treatment group received osajin (5 mg/kg/day in 0.5% Avicel); the second treatment group received pomiferin (5 mg/kg/day in 0.5% Avicel); the placebo group received only 0.5 Avicel; the last was an untreated control group. Biochemical indicator of oxidative damage-lipid peroxidation product malondialdehyde, antioxidant enzymes - superoxide dismutase, glutathione peroxidase, total antioxidant activity in serum and myocardium were evaluated. The effect of osajin and pomiferin on cardiac function, left ventricular end-diastolic pressure, left ventricular pressure and peak positive +dP/dt ischemia and reperfusion, also was examined. The results demonstrate that osajin and pomiferin attenuates the myocardial dysfunction provoked by ischemiareperfusion. This was confirmed by an increase in both antioxidant enzyme values and total antioxidant activity. The cardioprotection provided by osajin and pomiferin treatment results from the suppression of oxidative stress and this correlates with improved ventricular function.
Variational Policy Gradient Method for Reinforcement Learning with General Utilities

Junyu Zhang,Alec Koppel,Amrit Singh Bedi,Csaba Szepesvari,Mengdi Wang

DOI: https://doi.org/10.48550/arXiv.2007.02151

2020-07-05

Abstract:In recent years, reinforcement learning (RL) systems with general goals beyond a cumulative sum of rewards have gained traction, such as in constrained problems, exploration, and acting upon prior experiences. In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general concave utility function of the state-action occupancy measure, which subsumes several of the aforementioned examples as special cases. Such generality invalidates the Bellman equation. As this means that dynamic programming no longer works, we focus on direct policy search. Analogously to the Policy Gradient Theorem \cite{sutton2000policy} available for RL with cumulative rewards, we derive a new Variational Policy Gradient Theorem for RL with general utilities, which establishes that the parametrized policy gradient may be obtained as the solution of a stochastic saddle point problem involving the Fenchel dual of the utility function. We develop a variational Monte Carlo gradient estimation algorithm to compute the policy gradient based on sample paths. We prove that the variational policy gradient scheme converges globally to the optimal policy for the general objective, though the optimization problem is nonconvex. We also establish its rate of convergence of the order $O(1/t)$ by exploiting the hidden convexity of the problem, and proves that it converges exponentially when the problem admits hidden strong convexity. Our analysis applies to the standard RL problem with cumulative rewards as a special case, in which case our result improves the available convergence rate.

Machine Learning
On the Convergence of Discounted Policy Gradient Methods

Chris Nota

DOI: https://doi.org/10.48550/arXiv.2212.14066

2023-01-09

Abstract:Many popular policy gradient methods for reinforcement learning follow a biased approximation of the policy gradient known as the discounted approximation. While it has been shown that the discounted approximation of the policy gradient is not the gradient of any objective function, little else is known about its convergence behavior or properties. In this paper, we show that if the discounted approximation is followed such that the discount factor is increased slowly at a rate related to a decreasing learning rate, the resulting method recovers the standard guarantees of gradient ascent on the undiscounted objective.

Machine Learning,Artificial Intelligence
Policy Gradient for Reinforcement Learning with General Utilities

Navdeep Kumar,Kaixin Wang,Kfir Levy,Shie Mannor

2023-08-29

Abstract:In Reinforcement Learning (RL), the goal of agents is to discover an optimal policy that maximizes the expected cumulative rewards. This objective may also be viewed as finding a policy that optimizes a linear function of its state-action occupancy measure, hereafter referred as Linear RL. However, many supervised and unsupervised RL problems are not covered in the Linear RL framework, such as apprenticeship learning, pure exploration and variational intrinsic control, where the objectives are non-linear functions of the occupancy measures. RL with non-linear utilities looks unwieldy, as methods like Bellman equation, value iteration, policy gradient, dynamic programming that had tremendous success in Linear RL, fail to trivially generalize. In this paper, we derive the policy gradient theorem for RL with general utilities. The policy gradient theorem proves to be a cornerstone in Linear RL due to its elegance and ease of implementability. Our policy gradient theorem for RL with general utilities shares the same elegance and ease of implementability. Based on the policy gradient theorem derived, we also present a simple sample-based algorithm. We believe our results will be of interest to the community and offer inspiration to future works in this generalized setting.

Machine Learning
The Role of Baselines in Policy Gradient Optimization

Jincheng Mei,Wesley Chung,Valentin Thomas,Bo Dai,Csaba Szepesvari,Dale Schuurmans

DOI: https://doi.org/10.48550/arXiv.2301.06276

2023-01-16

Abstract:We study the effect of baselines in on-policy stochastic policy gradient optimization, and close the gap between the theory and practice of policy optimization methods. Our first contribution is to show that the \emph{state value} baseline allows on-policy stochastic \emph{natural} policy gradient (NPG) to converge to a globally optimal policy at an $O(1/t)$ rate, which was not previously known. The analysis relies on two novel findings: the expected progress of the NPG update satisfies a stochastic version of the non-uniform Łojasiewicz (NŁ) inequality, and with probability 1 the state value baseline prevents the optimal action's probability from vanishing, thus ensuring sufficient exploration. Importantly, these results provide a new understanding of the role of baselines in stochastic policy gradient: by showing that the variance of natural policy gradient estimates remains unbounded with or without a baseline, we find that variance reduction \emph{cannot} explain their utility in this setting. Instead, the analysis reveals that the primary effect of the value baseline is to \textbf{reduce the aggressiveness of the updates} rather than their variance. That is, we demonstrate that a finite variance is \emph{not necessary} for almost sure convergence of stochastic NPG, while controlling update aggressiveness is both necessary and sufficient. Additional experimental results verify these theoretical findings.

Machine Learning,Artificial Intelligence
Generalizable Policy Improvement Via Reinforcement Sampling (student Abstract)

Rui Kong,Chenyang Wu,Zongzhang Zhang

DOI: https://doi.org/10.1609/aaai.v38i21.30466

2024-01-01

Abstract:Current policy gradient techniques excel in refining policies over sampled states but falter when generalizing to unseen states. To address this, we introduce Reinforcement Sampling (RS), a novel method leveraging a generalizable action value function to sample improved decisions. RS is able to improve the decision quality whenever the action value estimation is accurate. It works by improving the agent's decision on the fly on the states the agent is visiting. Compared with the historically experienced states in which conventional policy gradient methods improve the policy, the currently visited states are more relevant to the agent. Our method sufficiently exploits the generalizability of the value function on unseen states and sheds new light on the future development of generalizable reinforcement learning.
A Temporal-Difference Approach to Policy Gradient Estimation

Samuele Tosatto,Andrew Patterson,Martha White,A. Rupam Mahmood

DOI: https://doi.org/10.48550/arXiv.2202.02396

2022-07-08

Abstract:The policy gradient theorem (Sutton et al., 2000) prescribes the usage of a cumulative discounted state distribution under the target policy to approximate the gradient. Most algorithms based on this theorem, in practice, break this assumption, introducing a distribution shift that can cause the convergence to poor solutions. In this paper, we propose a new approach of reconstructing the policy gradient from the start state without requiring a particular sampling strategy. The policy gradient calculation in this form can be simplified in terms of a gradient critic, which can be recursively estimated due to a new Bellman equation of gradients. By using temporal-difference updates of the gradient critic from an off-policy data stream, we develop the first estimator that sidesteps the distribution shift issue in a model-free way. We prove that, under certain realizability conditions, our estimator is unbiased regardless of the sampling strategy. We empirically show that our technique achieves a superior bias-variance trade-off and performance in presence of off-policy samples.

Machine Learning,Artificial Intelligence
A nearly Blackwell-optimal policy gradient method

Vektor Dewanto,Marcus Gallagher

DOI: https://doi.org/10.48550/arXiv.2105.13609

2022-07-03

Abstract:For continuing environments, reinforcement learning (RL) methods commonly maximize the discounted reward criterion with discount factor close to 1 in order to approximate the average reward (the gain). However, such a criterion only considers the long-run steady-state performance, ignoring the transient behaviour in transient states. In this work, we develop a policy gradient method that optimizes the gain, then the bias (which indicates the transient performance and is important to capably select from policies with equal gain). We derive expressions that enable sampling for the gradient of the bias and its preconditioning Fisher matrix. We further devise an algorithm that solves the gain-then-bias (bi-level) optimization. Its key ingredient is an RL-specific logarithmic barrier function. Experimental results provide insights into the fundamental mechanisms of our proposal.

Machine Learning,Artificial Intelligence,Systems and Control
The Reinforce Policy Gradient Algorithm Revisited

Shalabh Bhatnagar

2023-10-08

Abstract:We revisit the Reinforce policy gradient algorithm from the literature. Note that this algorithm typically works with cost returns obtained over random length episodes obtained from either termination upon reaching a goal state (as with episodic tasks) or from instants of visit to a prescribed recurrent state (in the case of continuing tasks). We propose a major enhancement to the basic algorithm. We estimate the policy gradient using a function measurement over a perturbed parameter by appealing to a class of random search approaches. This has advantages in the case of systems with infinite state and action spaces as it relax some of the regularity requirements that would otherwise be needed for proving convergence of the Reinforce algorithm. Nonetheless, we observe that even though we estimate the gradient of the performance objective using the performance objective itself (and not via the sample gradient), the algorithm converges to a neighborhood of a local minimum. We also provide a proof of convergence for this new algorithm.

Machine Learning,Artificial Intelligence,Systems and Control,Optimization and Control
Model-free Policy Learning with Reward Gradients

Qingfeng Lan,Samuele Tosatto,Homayoon Farrahi,A. Rupam Mahmood

2023-11-02

Abstract:Despite the increasing popularity of policy gradient methods, they are yet to be widely utilized in sample-scarce applications, such as robotics. The sample efficiency could be improved by making best usage of available information. As a key component in reinforcement learning, the reward function is usually devised carefully to guide the agent. Hence, the reward function is usually known, allowing access to not only scalar reward signals but also reward gradients. To benefit from reward gradients, previous works require the knowledge of environment dynamics, which are hard to obtain. In this work, we develop the \textit{Reward Policy Gradient} estimator, a novel approach that integrates reward gradients without learning a model. Bypassing the model dynamics allows our estimator to achieve a better bias-variance trade-off, which results in a higher sample efficiency, as shown in the empirical analysis. Our method also boosts the performance of Proximal Policy Optimization on different MuJoCo control tasks.

Machine Learning,Artificial Intelligence
Compatible Gradient Approximations for Actor-Critic Algorithms

Baturay Saglam,Dionysis Kalogerias

2024-09-03

Abstract:Deterministic policy gradient algorithms are foundational for actor-critic methods in controlling continuous systems, yet they often encounter inaccuracies due to their dependence on the derivative of the critic's value estimates with respect to input actions. This reliance requires precise action-value gradient computations, a task that proves challenging under function approximation. We introduce an actor-critic algorithm that bypasses the need for such precision by employing a zeroth-order approximation of the action-value gradient through two-point stochastic gradient estimation within the action space. This approach provably and effectively addresses compatibility issues inherent in deterministic policy gradient schemes. Empirical results further demonstrate that our algorithm not only matches but frequently exceeds the performance of current state-of-the-art methods.

Machine Learning
Matrix Low-Rank Approximation For Policy Gradient Methods

Sergio Rozada,Antonio G. Marques

2024-05-28

Abstract:Estimating a policy that maps states to actions is a central problem in reinforcement learning. Traditionally, policies are inferred from the so called value functions (VFs), but exact VF computation suffers from the curse of dimensionality. Policy gradient (PG) methods bypass this by learning directly a parametric stochastic policy. Typically, the parameters of the policy are estimated using neural networks (NNs) tuned via stochastic gradient descent. However, finding adequate NN architectures can be challenging, and convergence issues are common as well. In this paper, we put forth low-rank matrix-based models to estimate efficiently the parameters of PG algorithms. We collect the parameters of the stochastic policy into a matrix, and then, we leverage matrix-completion techniques to promote (enforce) low rank. We demonstrate via numerical studies how low-rank matrix-based policy models reduce the computational and sample complexities relative to NN models, while achieving a similar aggregated reward.

Machine Learning,Artificial Intelligence
Approximation Benefits of Policy Gradient Methods with Aggregated States

Daniel Russo

DOI: https://doi.org/10.1287/mnsc.2023.4788

IF: 5.4

2023-06-29

Management Science

Abstract:Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregated representations, in which the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per period is bounded by ε, the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as [Formula: see text], where γ is a discount factor. Faced with inherent approximation error, methods that locally optimize the true decision objective can be far more robust. This paper was accepted by Hamid Nazerzadeh, data science. Supplemental Material: Data are available at https://doi.org/10.1287/mnsc.2023.4788 .

management,operations research & management science
Beyond Expected Returns: A Policy Gradient Algorithm for Cumulative Prospect Theoretic Reinforcement Learning

Olivier Lepel,Anas Barakat

2024-10-03

Abstract:The widely used expected utility theory has been shown to be empirically inconsistent with human preferences in the psychology and behavioral economy literatures. Cumulative Prospect Theory (CPT) has been developed to fill in this gap and provide a better model for human-based decision-making supported by empirical evidence. It allows to express a wide range of attitudes and perceptions towards risk, gains and losses. A few years ago, CPT has been combined with Reinforcement Learning (RL) to formulate a CPT policy optimization problem where the goal of the agent is to search for a policy generating long-term returns which are aligned with their preferences. In this work, we revisit this policy optimization problem and provide new insights on optimal policies and their nature depending on the utility function under consideration. We further derive a novel policy gradient theorem for the CPT policy optimization objective generalizing the seminal corresponding result in standard RL. This result enables us to design a model-free policy gradient algorithm to solve the CPT-RL problem. We illustrate the performance of our algorithm in simple examples motivated by traffic control and electricity management applications. We also demonstrate that our policy gradient algorithm scales better to larger state spaces compared to the existing zeroth order algorithm for solving the same problem.

Machine Learning,Artificial Intelligence
A policy gradient approach for Finite Horizon Constrained Markov Decision Processes

Soumyajit Guin,Shalabh Bhatnagar

DOI: https://doi.org/10.1109/CDC49753.2023.10383413

2024-10-14

Abstract:The infinite horizon setting is widely adopted for problems of reinforcement learning (RL). These invariably result in stationary policies that are optimal. In many situations, finite horizon control problems are of interest and for such problems, the optimal policies are time-varying in general. Another setting that has become popular in recent times is of Constrained Reinforcement Learning, where the agent maximizes its rewards while it also aims to satisfy some given constraint criteria. However, this setting has only been studied in the context of infinite horizon MDPs where stationary policies are optimal. We present an algorithm for constrained RL in the Finite Horizon Setting where the horizon terminates after a fixed (finite) time. We use function approximation in our algorithm which is essential when the state and action spaces are large or continuous and use the policy gradient method to find the optimal policy. The optimal policy that we obtain depends on the stage and so is non-stationary in general. To the best of our knowledge, our paper presents the first policy gradient algorithm for the finite horizon setting with constraints. We show the convergence of our algorithm to a constrained optimal policy. We also compare and analyze the performance of our algorithm through experiments and show that our algorithm performs better than some other well known algorithms.

Machine Learning
Smoothed functional-based gradient algorithms for off-policy reinforcement learning: A non-asymptotic viewpoint

Nithia Vijayan,Prashanth L. A

2024-06-24

Abstract:We propose two policy gradient algorithms for solving the problem of control in an off-policy reinforcement learning (RL) context. Both algorithms incorporate a smoothed functional (SF) based gradient estimation scheme. The first algorithm is a straightforward combination of importance sampling-based off-policy evaluation with SF-based gradient estimation. The second algorithm, inspired by the stochastic variance-reduced gradient (SVRG) algorithm, incorporates variance reduction in the update iteration. For both algorithms, we derive non-asymptotic bounds that establish convergence to an approximate stationary point. From these results, we infer that the first algorithm converges at a rate that is comparable to the well-known REINFORCE algorithm in an off-policy RL context, while the second algorithm exhibits an improved rate of convergence.

Machine Learning
Global Convergence of Policy Gradient Methods in Reinforcement Learning, Games and Control

Shicong Cen,Yuejie Chi

2023-10-09

Abstract:Policy gradient methods, where one searches for the policy of interest by maximizing the value functions using first-order information, become increasingly popular for sequential decision making in reinforcement learning, games, and control. Guaranteeing the global optimality of policy gradient methods, however, is highly nontrivial due to nonconcavity of the value functions. In this exposition, we highlight recent progresses in understanding and developing policy gradient methods with global convergence guarantees, putting an emphasis on their finite-time convergence rates with regard to salient problem parameters.

Optimization and Control,Computer Science and Game Theory,Information Theory,Machine Learning
Policy Gradient Method For Robust Reinforcement Learning

Yue Wang,Shaofeng Zou

DOI: https://doi.org/10.48550/arXiv.2205.07344

IF: 5.414

2022-05-15

Machine Learning

Abstract:This paper develops the first policy gradient method with global optimality guarantee and complexity analysis for robust reinforcement learning under model mismatch. Robust reinforcement learning is to learn a policy robust to model mismatch between simulator and real environment. We first develop the robust policy (sub-)gradient, which is applicable for any differentiable parametric policy class. We show that the proposed robust policy gradient method converges to the global optimum asymptotically under direct policy parameterization. We further develop a smoothed robust policy gradient method and show that to achieve an $\epsilon$-global optimum, the complexity is $\mathcal O(\epsilon^{-3})$. We then extend our methodology to the general model-free setting and design the robust actor-critic method with differentiable parametric policy class and value function. We further characterize its asymptotic convergence and sample complexity under the tabular setting. Finally, we provide simulation results to demonstrate the robustness of our methods.

Policy Gradient Methods for Reinforcement Learning with Function Approximation and Action-Dependent Baselines

Gradient Q : A Unified Algorithm with Function Approximation for Reinforcement Learning

Optimal Control-Based Baseline for Guided Exploration in Policy Gradient Methods

Effects of prenylated isoflavones osajin and pomiferin in premedication on heart ischemia-reperfusion.

Variational Policy Gradient Method for Reinforcement Learning with General Utilities

On the Convergence of Discounted Policy Gradient Methods

Policy Gradient for Reinforcement Learning with General Utilities

The Role of Baselines in Policy Gradient Optimization

Generalizable Policy Improvement Via Reinforcement Sampling (student Abstract)

A Temporal-Difference Approach to Policy Gradient Estimation

A nearly Blackwell-optimal policy gradient method

The Reinforce Policy Gradient Algorithm Revisited

Model-free Policy Learning with Reward Gradients

Compatible Gradient Approximations for Actor-Critic Algorithms

Matrix Low-Rank Approximation For Policy Gradient Methods

Approximation Benefits of Policy Gradient Methods with Aggregated States

Beyond Expected Returns: A Policy Gradient Algorithm for Cumulative Prospect Theoretic Reinforcement Learning

A policy gradient approach for Finite Horizon Constrained Markov Decision Processes

Smoothed functional-based gradient algorithms for off-policy reinforcement learning: A non-asymptotic viewpoint

Global Convergence of Policy Gradient Methods in Reinforcement Learning, Games and Control

Policy Gradient Method For Robust Reinforcement Learning