A Batch, Off-Policy, Actor-Critic Algorithm for Optimizing the Average Reward

S.A. Murphy,Y. Deng,E.B. Laber,H.R. Maei,R.S. Sutton,K. Witkiewitz

DOI: https://doi.org/10.48550/arXiv.1607.05047

2016-07-18

Abstract:We develop an off-policy actor-critic algorithm for learning an optimal policy from a training set composed of data from multiple individuals. This algorithm is developed with a view towards its use in mobile health.

Machine Learning

What problem does this paper attempt to address?

Beyond Reward: Offline Preference-guided Policy Optimization

Yachen Kang,Diyuan Shi,Jinxin Liu,Li He,Donglin Wang

DOI: https://doi.org/10.48550/arxiv.2305.16217

2023-01-01

Abstract:This study focuses on the topic of offline preference-based reinforcement learning (PbRL), a variant of conventional reinforcement learning that dispenses with the need for online interaction or specification of reward functions. Instead, the agent is provided with fixed offline trajectories and human preferences between pairs of trajectories to extract the dynamics and task information, respectively. Since the dynamics and task information are orthogonal, a naive approach would involve using preference-based reward learning followed by an off-the-shelf offline RL algorithm. However, this requires the separate learning of a scalar reward function, which is assumed to be an information bottleneck of the learning process. To address this issue, we propose the offline preference-guided policy optimization (OPPO) paradigm, which models offline trajectories and preferences in a one-step process, eliminating the need for separately learning a reward function. OPPO achieves this by introducing an offline hindsight information matching objective for optimizing a contextual policy and a preference modeling objective for finding the optimal context. OPPO further integrates a well-performing decision policy by optimizing the two objectives iteratively. Our empirical results demonstrate that OPPO effectively models offline preferences and outperforms prior competing baselines, including offline RL algorithms performed over either true or pseudo reward function specifications. Our code is available on the project website: https://sites.google.com/view/oppo-icml-2023 .
Behavior Proximal Policy Optimization

Zifeng Zhuang,Kun LEI,Jinxin Liu,Donglin Wang,Yilang Guo

DOI: https://doi.org/10.48550/arxiv.2302.11312

2023-01-01

Abstract:Offline reinforcement learning (RL) is a challenging setting where existing off-policy actor-critic methods perform poorly due to the overestimation of out-of-distribution actions. Thus, various additional augmentations are proposed to keep the learned policy close to the offline dataset (or behavior policy). In this work, starting from the analysis of offline monotonic policy improvement, we get a surprising finding that some online on-policy algorithms are naturally able to solve offline RL. Specifically, the inherent conservatism of these on-policy algorithms is exactly what the offline RL method needs to accomplish the closeness. Based on this, we design an algorithm called Behavior Proximal Policy Optimization (BPPO), which successfully solves offline RL without any extra constraint or regularization introduced. Extensive experiments on the D4RL benchmark indicate this extremely succinct method outperforms state-of-the-art offline RL algorithms.
Off-Policy Average Reward Actor-Critic with Deterministic Policy Search

Naman Saxena,Subhojyoti Khastigir,Shishir Kolathaya,Shalabh Bhatnagar

2023-07-19

Abstract:The average reward criterion is relatively less studied as most existing works in the Reinforcement Learning literature consider the discounted reward criterion. There are few recent works that present on-policy average reward actor-critic algorithms, but average reward off-policy actor-critic is relatively less explored. In this work, we present both on-policy and off-policy deterministic policy gradient theorems for the average reward performance criterion. Using these theorems, we also present an Average Reward Off-Policy Deep Deterministic Policy Gradient (ARO-DDPG) Algorithm. We first show asymptotic convergence analysis using the ODE-based method. Subsequently, we provide a finite time analysis of the resulting stochastic approximation scheme with linear function approximator and obtain an $\epsilon$-optimal stationary policy with a sample complexity of $\Omega(\epsilon^{-2.5})$. We compare the average reward performance of our proposed ARO-DDPG algorithm and observe better empirical performance compared to state-of-the-art on-policy average reward actor-critic algorithms over MuJoCo-based environments.

Machine Learning,Artificial Intelligence
BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES

Peng Liao,Zhengling Qi,Runzhe Wan,Predrag Klasnja,Susan A Murphy

DOI: https://doi.org/10.1214/22-aos2231

Abstract:We consider the batch (off-line) policy learning problem in the infinite horizon Markov Decision Process. Motivated by mobile health applications, we focus on learning a policy that maximizes the long-term average reward. We propose a doubly robust estimator for the average reward and show that it achieves semiparametric efficiency. Further we develop an optimization algorithm to compute the optimal policy in a parameterized stochastic policy class. The performance of the estimated policy is measured by the difference between the optimal average reward in the policy class and the average reward of the estimated policy and we establish a finite-sample regret guarantee. The performance of the method is illustrated by simulation studies and an analysis of a mobile health study promoting physical activity.
Optimal Actor-Critic Policy With Optimized Training Datasets

Chayan Banerjee,Zhiyong Chen,Nasimul Noman,Mohsen Zamani

DOI: https://doi.org/10.1109/tetci.2022.3140375

2022-01-01

IEEE Transactions on Emerging Topics in Computational Intelligence

Abstract:Actor-critic(AC) algorithms are known for their efficacy and high performance in solving reinforcement learning problems, but they also suffer from low sampling efficiency. An AC based policy optimization process is iterative and needs to access the agent-environment to evaluate and update the policy by rolling out the policy, collecting rewards and states (i.e. samples), and learning from them. It ultimately requires a huge number of samples to learn an optimal policy. To improve sampling efficiency, we propose a strategy to optimize the training dataset that contains significantly less samples collected from the AC process. The dataset optimization is made of a best episode only operation, a policy parameter-fitness model, and a genetic algorithm module. The optimal policy network trained by the optimized training dataset exhibits superior performance compared to many contemporary AC algorithms in controlling autonomous dynamical systems. Evaluation on standard benchmarks shows that the method improves sampling efficiency, ensures faster convergence to optima, and is more data-efficient than its counterparts.
Importance Weighted Actor-Critic for Optimal Conservative Offline Reinforcement Learning

Hanlin Zhu,Paria Rashidinejad,Jiantao Jiao

2023-10-09

Abstract:We propose A-Crab (Actor-Critic Regularized by Average Bellman error), a new practical algorithm for offline reinforcement learning (RL) in complex environments with insufficient data coverage. Our algorithm combines the marginalized importance sampling framework with the actor-critic paradigm, where the critic returns evaluations of the actor (policy) that are pessimistic relative to the offline data and have a small average (importance-weighted) Bellman error. Compared to existing methods, our algorithm simultaneously offers a number of advantages: (1) It achieves the optimal statistical rate of $1/\sqrt{N}$ -- where $N$ is the size of offline dataset -- in converging to the best policy covered in the offline dataset, even when combined with general function approximators. (2) It relies on a weaker average notion of policy coverage (compared to the $\ell_\infty$ single-policy concentrability) that exploits the structure of policy visitations. (3) It outperforms the data-collection behavior policy over a wide range of specific hyperparameters. We provide both theoretical analysis and experimental results to validate the effectiveness of our proposed algorithm.

Machine Learning
Order-Optimal Global Convergence for Average Reward Reinforcement Learning via Actor-Critic Approach

Swetha Ganesh,Washim Uddin Mondal,Vaneet Aggarwal

2024-10-22

Abstract:This work analyzes average-reward reinforcement learning with general parametrization. Current state-of-the-art (SOTA) guarantees for this problem are either suboptimal or demand prior knowledge of the mixing time of the underlying Markov process, which is unavailable in most practical scenarios. We introduce a Multi-level Monte Carlo-based Natural Actor-Critic (MLMC-NAC) algorithm to address these issues. Our approach is the first to achieve a global convergence rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$ without needing the knowledge of mixing time. It significantly surpasses the SOTA bound of $\tilde{\mathcal{O}}(T^{-1/4})$ where $T$ is the horizon length.

Machine Learning
Off-Policy Neural Fitted Actor-Critic

Matthieu Zimmer,Yann Boniface,Alain Dutech

2016-01-01

Abstract:A new off-policy, offline, model-free, actor-critic reinforcement learning algorithm dealing with continuous environments in both states and actions is presented. It addresses discrete time problems where the goal is to maximize the discounted sum of rewards using stationary policies. Our algorithm allows to trade-off between data-efficiency and scalability. The amount of a priori knowledge is kept low by: (1) using neural networks to learn both the critic and the actor, (2) not relying on initial trajectories provided by an expert, and (3) not depending on known goal states. Experimental results compare data-efficiency to 4 state-of-the-art algorithms on three benchmark environments. This article largely reproduces a previous work [34] by adding a higher dimensional environment, improving control architectures and provides batch normalization for others state-of-the-art algorithms.
Variance-Constrained Actor-Critic Algorithms for Discounted and Average Reward MDPs

Prashanth L.A.,Mohammad Ghavamzadeh

DOI: https://doi.org/10.48550/arXiv.1403.6530

2015-03-18

Abstract:In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms that operate on three timescales - a TD critic on the fastest timescale, a policy gradient (actor) on the intermediate timescale, and a dual ascent for Lagrange multipliers on the slowest timescale. In the discounted setting, we point out the difficulty in estimating the gradient of the variance of the return and incorporate simultaneous perturbation approaches to alleviate this. The average setting, on the other hand, allows for an actor update using compatible features to estimate the gradient of the variance. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application.

Machine Learning,Optimization and Control
Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL

Yu Luo,Tianying Ji,Fuchun Sun,Jianwei Zhang,Huazhe Xu,Xianyuan Zhan

2024-05-29

Abstract:Off-policy reinforcement learning (RL) has achieved notable success in tackling many complex real-world tasks, by leveraging previously collected data for policy learning. However, most existing off-policy RL algorithms fail to maximally exploit the information in the replay buffer, limiting sample efficiency and policy performance. In this work, we discover that concurrently training an offline RL policy based on the shared online replay buffer can sometimes outperform the original online learning policy, though the occurrence of such performance gains remains uncertain. This motivates a new possibility of harnessing the emergent outperforming offline optimal policy to improve online policy learning. Based on this insight, we present Offline-Boosted Actor-Critic (OBAC), a model-free online RL framework that elegantly identifies the outperforming offline policy through value comparison, and uses it as an adaptive constraint to guarantee stronger policy learning performance. Our experiments demonstrate that OBAC outperforms other popular model-free RL baselines and rivals advanced model-based RL methods in terms of sample efficiency and asymptotic performance across 53 tasks spanning 6 task suites.

Machine Learning,Artificial Intelligence
Hierarchical Average Reward Policy Gradient Algorithms

Akshay Dharmavaram,Matthew Riemer,Shalabh Bhatnagar

DOI: https://doi.org/10.48550/arXiv.1911.08826

2019-11-20

Abstract:Option-critic learning is a general-purpose reinforcement learning (RL) framework that aims to address the issue of long term credit assignment by leveraging temporal abstractions. However, when dealing with extended timescales, discounting future rewards can lead to incorrect credit assignments. In this work, we address this issue by extending the hierarchical option-critic policy gradient theorem for the average reward criterion. Our proposed framework aims to maximize the long-term reward obtained in the steady-state of the Markov chain defined by the agent's policy. Furthermore, we use an ordinary differential equation based approach for our convergence analysis and prove that the parameters of the intra-option policies, termination functions, and value functions, converge to their corresponding optimal values, with probability one. Finally, we illustrate the competitive advantage of learning options, in the average reward setting, on a grid-world environment with sparse rewards.

Machine Learning,Artificial Intelligence
Distillation Policy Optimization

Jianfei Ma

2023-09-27

Abstract:While on-policy algorithms are known for their stability, they often demand a substantial number of samples. In contrast, off-policy algorithms, which leverage past experiences, are considered sample-efficient but tend to exhibit instability. Can we develop an algorithm that harnesses the benefits of off-policy data while maintaining stable learning? In this paper, we introduce an actor-critic learning framework that harmonizes two data sources for both evaluation and control, facilitating rapid learning and adaptable integration with on-policy algorithms. This framework incorporates variance reduction mechanisms, including a unified advantage estimator (UAE) and a residual baseline, improving the efficacy of both on- and off-policy learning. Our empirical results showcase substantial enhancements in sample efficiency for on-policy algorithms, effectively bridging the gap to the off-policy approaches. It demonstrates the promise of our approach as a novel learning paradigm.

Machine Learning,Artificial Intelligence
Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm

Raghuram Bharadwaj Diddigi,Prateek Jain,Prabuchandran K.J.,Shalabh Bhatnagar

DOI: https://doi.org/10.48550/arXiv.2110.10017

2022-06-15

Abstract:Learning optimal behavior from existing data is one of the most important problems in Reinforcement Learning (RL). This is known as "off-policy control" in RL where an agent's objective is to compute an optimal policy based on the data obtained from the given policy (known as the behavior policy). As the optimal policy can be very different from the behavior policy, learning optimal behavior is very hard in the "off-policy" setting compared to the "on-policy" setting where new data from the policy updates will be utilized in learning. This work proposes an off-policy natural actor-critic algorithm that utilizes state-action distribution correction for handling the off-policy behavior and the natural policy gradient for sample efficiency. The existing natural gradient-based actor-critic algorithms with convergence guarantees require fixed features for approximating both policy and value functions. This often leads to sub-optimal learning in many RL applications. On the other hand, our proposed algorithm utilizes compatible features that enable one to use arbitrary neural networks to approximate the policy and the value function and guarantee convergence to a locally optimal policy. We illustrate the benefit of the proposed off-policy natural gradient algorithm by comparing it with the vanilla gradient actor-critic algorithm on benchmark RL tasks.

Machine Learning,Artificial Intelligence
On-Policy Deep Reinforcement Learning for the Average-Reward Criterion

Yiming Zhang,Keith W. Ross

DOI: https://doi.org/10.48550/arXiv.2106.07329

2021-06-14

Abstract:We develop theory and algorithms for average-reward on-policy Reinforcement Learning (RL). We first consider bounding the difference of the long-term average reward for two policies. We show that previous work based on the discounted return (Schulman et al., 2015; Achiam et al., 2017) results in a non-meaningful bound in the average-reward setting. By addressing the average-reward criterion directly, we then derive a novel bound which depends on the average divergence between the two policies and Kemeny's constant. Based on this bound, we develop an iterative procedure which produces a sequence of monotonically improved policies for the average reward criterion. This iterative procedure can then be combined with classic DRL (Deep Reinforcement Learning) methods, resulting in practical DRL algorithms that target the long-run average reward criterion. In particular, we demonstrate that Average-Reward TRPO (ATRPO), which adapts the on-policy TRPO algorithm to the average-reward criterion, significantly outperforms TRPO in the most challenging MuJuCo environments.

Machine Learning,Artificial Intelligence
Toward a data efficient neural actor-critic

Matthieu Zimmer,Yann Boniface,Alain Dutech

2016-01-01

Abstract:A new off-policy, offline, model-free, actor-critic reinforcement learning algorithm dealing with continuous environments in both states and actions is presented. It addresses discrete time problems where the goal is to maximize the discounted sum of rewards using stationary policies. Our algorithm allows to trade-off between data-efficiency and scalability. The amount of a priori knowledge is kept low by: (1) using neural networks to learn both the critic and the actor, (2) not relying on initial trajectories provided by an expert, and (3) not depending on known goal states. Experimental results show better data-efficiency than 4 state-of-the-art algorithms on two benchmark environments.
Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline

Wenjia Meng,Qian Zheng,Long Yang,Yilong Yin,Gang Pan

2024-05-04

Abstract:Policy-based methods have achieved remarkable success in solving challenging reinforcement learning problems. Among these methods, off-policy policy gradient methods are particularly important due to that they can benefit from off-policy data. However, these methods suffer from the high variance of the off-policy policy gradient (OPPG) estimator, which results in poor sample efficiency during training. In this paper, we propose an off-policy policy gradient method with the optimal action-dependent baseline (Off-OAB) to mitigate this variance issue. Specifically, this baseline maintains the OPPG estimator's unbiasedness while theoretically minimizing its variance. To enhance practical computational efficiency, we design an approximated version of this optimal baseline. Utilizing this approximation, our method (Off-OAB) aims to decrease the OPPG estimator's variance during policy optimization. We evaluate the proposed Off-OAB method on six representative tasks from OpenAI Gym and MuJoCo, where it demonstrably surpasses state-of-the-art methods on the majority of these tasks.

Machine Learning,Artificial Intelligence
The Actor-Advisor: Policy Gradient With Off-Policy Advice

Hélène Plisnier,Denis Steckelmacher,Diederik M. Roijers,Ann Nowé

DOI: https://doi.org/10.48550/arXiv.1902.02556

2019-02-07

Abstract:Actor-critic algorithms learn an explicit policy (actor), and an accompanying value function (critic). The actor performs actions in the environment, while the critic evaluates the actor's current policy. However, despite their stability and promising convergence properties, current actor-critic algorithms do not outperform critic-only ones in practice. We believe that the fact that the critic learns Q^pi, instead of the optimal Q-function Q*, prevents state-of-the-art robust and sample-efficient off-policy learning algorithms from being used. In this paper, we propose an elegant solution, the Actor-Advisor architecture, in which a Policy Gradient actor learns from unbiased Monte-Carlo returns, while being shaped (or advised) by the Softmax policy arising from an off-policy critic. The critic can be learned independently from the actor, using any state-of-the-art algorithm. Being advised by a high-quality critic, the actor quickly and robustly learns the task, while its use of the Monte-Carlo return helps overcome any bias the critic may have. In addition to a new Actor-Critic formulation, the Actor-Advisor, a method that allows an external advisory policy to shape a Policy Gradient actor, can be applied to many other domains. By varying the source of advice, we demonstrate the wide applicability of the Actor-Advisor to three other important subfields of RL: safe RL with backup policies, efficient leverage of domain knowledge, and transfer learning in RL. Our experimental results demonstrate the benefits of the Actor-Advisor compared to state-of-the-art actor-critic methods, illustrate its applicability to the three other application scenarios listed above, and show that many important challenges of RL can now be solved using a single elegant solution.

Artificial Intelligence
ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages

Andrew Jesson,Chris Lu,Gunshi Gupta,Nicolas Beltran-Velez,Angelos Filos,Jakob Nicolaus Foerster,Yarin Gal

2024-10-10

Abstract:This paper proposes a step toward approximate Bayesian inference in on-policy actor-critic deep reinforcement learning. It is implemented through three changes to the Asynchronous Advantage Actor-Critic (A3C) algorithm: (1) applying a ReLU function to advantage estimates, (2) spectral normalization of actor-critic weights, and (3) incorporating \emph{dropout as a Bayesian approximation}. We prove under standard assumptions that restricting policy updates to positive advantages optimizes for value by maximizing a lower bound on the value function plus an additive term. We show that the additive term is bounded proportional to the Lipschitz constant of the value function, which offers theoretical grounding for spectral normalization of critic weights. Finally, our application of dropout corresponds to approximate Bayesian inference over both the actor and critic parameters, which enables \textit{adaptive state-aware} exploration around the modes of the actor via Thompson sampling. We demonstrate significant improvements for median and interquartile mean metrics over A3C, PPO, SAC, and TD3 on the MuJoCo continuous control benchmark and improvement over PPO in the challenging ProcGen generalization benchmark.

Machine Learning
Multi-agent Gradient-Based Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

Jineng Ren

DOI: https://doi.org/10.1007/s44196-024-00560-2

IF: 2.259

2024-06-26

International Journal of Computational Intelligence Systems

Abstract:This paper proposes a gradient-based multi-agent actor-critic algorithm for off-policy reinforcement learning using importance sampling. Our algorithm is incremental with full gradients, and its complexity per iteration scales linearly with the size of approximation features. Previous multi-agent actor-critic algorithms are limited to the on-policy setting or off-policy emphatic temporal difference (TD) learning and they do not take advantage of the advances in off-policy gradient temporal difference learning (GTD). As a theoretical contribution, we establish that the critic step of the proposed algorithm converges to the TD solution of the projected Bellman equation and the actor step converges to the set of asymptotically stable fixed points. Numerical experiments on the multi-agent generalization of the Boyan's chain problem show that the proposed approach provides improved performances in terms of stability and convergence rate as compared with the state-of-the-art baseline algorithm.

computer science, artificial intelligence, interdisciplinary applications
An Approximate Policy Iteration Viewpoint of Actor-Critic Algorithms

Zaiwei Chen,Siva Theja Maguluri

DOI: https://doi.org/10.48550/arXiv.2208.03247

2023-01-13

Abstract:In this work, we consider policy-based methods for solving the reinforcement learning problem, and establish the sample complexity guarantees. A policy-based algorithm typically consists of an actor and a critic. We consider using various policy update rules for the actor, including the celebrated natural policy gradient. In contrast to the gradient ascent approach taken in the literature, we view natural policy gradient as an approximate way of implementing policy iteration, and show that natural policy gradient (without any regularization) enjoys geometric convergence when using increasing stepsizes. As for the critic, we consider using TD-learning with linear function approximation and off-policy sampling. Since it is well-known that in this setting TD-learning can be unstable, we propose a stable generic algorithm (including two specific algorithms: the $\lambda$-averaged $Q$-trace and the two-sided $Q$-trace) that uses multi-step return and generalized importance sampling factors, and provide the finite-sample analysis. Combining the geometric convergence of the actor with the finite-sample analysis of the critic, we establish for the first time an overall $\mathcal{O}(\epsilon^{-2})$ sample complexity for finding an optimal policy (up to a function approximation error) using policy-based methods under off-policy sampling and linear function approximation.

Machine Learning

A Batch, Off-Policy, Actor-Critic Algorithm for Optimizing the Average Reward

Beyond Reward: Offline Preference-guided Policy Optimization

Behavior Proximal Policy Optimization

Off-Policy Average Reward Actor-Critic with Deterministic Policy Search

BATCH POLICY LEARNING IN AVERAGE REWARD MARKOV DECISION PROCESSES

Optimal Actor-Critic Policy With Optimized Training Datasets

Importance Weighted Actor-Critic for Optimal Conservative Offline Reinforcement Learning

Order-Optimal Global Convergence for Average Reward Reinforcement Learning via Actor-Critic Approach

Off-Policy Neural Fitted Actor-Critic

Variance-Constrained Actor-Critic Algorithms for Discounted and Average Reward MDPs

Offline-Boosted Actor-Critic: Adaptively Blending Optimal Historical Behaviors in Deep Off-Policy RL

Hierarchical Average Reward Policy Gradient Algorithms

Distillation Policy Optimization

Neural Network Compatible Off-Policy Natural Actor-Critic Algorithm

On-Policy Deep Reinforcement Learning for the Average-Reward Criterion

Toward a data efficient neural actor-critic

Off-OAB: Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline

The Actor-Advisor: Policy Gradient With Off-Policy Advice

ReLU to the Rescue: Improve Your On-Policy Actor-Critic with Positive Advantages

Multi-agent Gradient-Based Off-Policy Actor-Critic Algorithm for Distributed Reinforcement Learning

An Approximate Policy Iteration Viewpoint of Actor-Critic Algorithms